
OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset
Abstract
Argument mining plays a pivotal role in developing advanced language models capable of sophisticated reasoning and understanding. This paper, published at NeurIPS 2024 (Track on Datasets and Benchmarks), introduces OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. The dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence available. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. The dataset is publicly available on HuggingFace to support further research and innovation in computational argumentation.
Introduction and Motivation
Argument mining plays a pivotal role in developing advanced language models (LLMs) capable of sophisticated reasoning and understanding. Engaging with complex argumentative texts enhances LLMs' abilities to comprehend, generate, and evaluate arguments, improving their performance in applications such as legal document analysis, educational tools, and more. Existing argument mining datasets, such as DebateSum with 240,566 examples, are limited in scope, primarily focusing on pre-season evidence from summer camps and excluding the rich argumentative structures in regular-season debates. This limitation affects dataset size, representativeness, and utility for large-scale argument mining. To address these gaps, the authors introduce OpenDebateEvidence, a large-scale dataset sourced from the OpenCaseList project, comprising 3.5 million documents. The paper demonstrates that training LLMs on OpenDebateEvidence significantly improves their performance not only on this dataset but also on other related argumentative datasets, using state-of-the-art models (LLaMA3-8B and Mistral-7B) fine-tuned with advanced techniques such as LoRA (Low-Rank Adaptation), ReFT (Representation Fine-Tuning), and Orthogonalization.
Data Collection and Preprocessing
OpenDebateEvidence is sourced from the OpenCaseList project, an online platform where high school and college debate teams disclose and open-source their evidence. The dataset contains over 3.5 million documents, covering all NSDA debate topics from 2014 to 2022. Each document corresponds to a single piece of evidence used in a debate, categorized by debate format (Policy, Lincoln-Douglas, Public Forum), and includes comprehensive metadata such as author, date, title, source, citation details, and the debate round in which it was used. Debate evidence is stored in .docx format, requiring a specialized parsing pipeline. The process begins by unzipping .docx files to access internal XML files, which are parsed to extract formatting details such as underlining, bold, and highlighting. The document then undergoes tokenization, simplification, and structuring into individual debate cards with metadata and content. A robust deduplication algorithm splits each card's text into sentences, preprocesses them, and compares card IDs by shared sentence overlap. Duplicate clusters are formed and a representative card is selected from each cluster based on sentence count and content quality.
Dataset Structure and Debate Formats
The dataset covers three prominent competitive debate styles. Policy Debate (also known as "cross-examination" or CX) involves two teams arguing for and against a specific policy proposal, with eight speeches and cross-examination periods lasting about 90 minutes. It constitutes approximately two-thirds of the dataset. Lincoln-Douglas Debate is a one-on-one format emphasizing ethical and moral reasoning with a bimonthly topic, comprising about one-third of the dataset. Public Forum Debate is a two-on-two format on monthly topics designed for broader audiences, constituting a smaller portion. Each evidence document is organized with a "hat" (broad argument category), "pocket" (speech section), and "tag" (biased abstractive summary). This hierarchical metadata encodes the rhetorical structure and purpose of the evidence, with "hats" and "pockets" providing overarching structure while "tags" summarize key points. For argument mining, this metadata offers valuable semantic annotations for training models on argument components and relations.
Dataset Statistics and Rich Metadata
The dataset offers a comprehensive collection of over 3.5 million documents categorized by debate format (Policy, Lincoln-Douglas, and Public Forum). Each document is enriched with over 40 columns of extensive metadata. Key statistics include: 4,830,561 total rows, 3,512,280 total documents with valid full-text, 2,768,419 Policy Debate evidence documents, 1,526,383 Lincoln-Douglas Debate evidence documents, and 43,131 Public Forum Debate evidence documents. The dataset spans topics from 2014 to 2022, represents over 1,366 schools and 6,455 unique authors across 68 unique debate topics with 45 metadata features. The dataset also includes standardized tags describing argument types (topicality, disadvantages, advantages, counterplans) and token-level extractive summaries formed by the hierarchical formatting (underlined, bolded, and highlighted text) that debaters use to mark crucial portions of evidence for oral presentation.
Experimental Setup and Fine-Tuning Methods
To evaluate the efficacy of OpenDebateEvidence, the authors conduct extensive fine-tuning experiments using three recent techniques: LoRA (parameter-efficient adaptation via low-rank matrices), ReFT (Representation Fine-Tuning, which refines hidden representations), and Orthogonalization (a conservative approach to parameter adjustment). Experiments are performed on three datasets: OpenDebateEvidence, DebateSum, and BillSum (US legislation summaries), using LLaMA3-8B, LLaMA3-70B, and Mistral-7B models on a 4xA100 machine. Two evaluation approaches are employed: traditional NLP evaluation metrics (ROUGE F1 scores and perplexity on 10,000 sampled documents) and LLM-as-Judge evaluation (using GPT-4o to rate the output quality and support quality of 1,000 generated abstracts on a 1-10 scale).
Results and Performance Analysis
For OpenDebateEvidence, the LLaMA3-70B models significantly outperformed smaller models across all ROUGE metrics. The LoRA fine-tuned LLaMA3-70B achieved the highest scores (R-1: 37.2±1.0, R-2: 15.8±0.8, R-L: 33.4±1.1) with the lowest perplexity (28.9±2.5) and highest LLM-as-Judge scores (Output Quality: 8.2±0.2, Support Quality: 8.1±0.2). LoRA consistently proved the most effective fine-tuning technique across all model sizes. ReFT also showed strong improvements, while Orthogonalization was less impactful. The base versions of Google Gemini and Anthropic Claude demonstrated competitive performance out-of-the-box, outperforming base LLaMA3-8B and Mistral-7B, but were still outperformed by the fine-tuned LLaMA3-70B models. On BillSum, similar trends held, with LoRA fine-tuned LLaMA3-70B achieving the best results (R-1: 54.6±1.0, R-2: 30.5±0.8, R-L: 50.0±1.1). On DebateSum, the fine-tuned models again showed significant gains, confirming that training on OpenDebateEvidence transfers effectively to related argumentative summarization tasks.
Conclusion
OpenDebateEvidence is a large-scale dataset for argument mining and summarization, comprising 3.5 million documents from the OpenCaseList project. After extensive preprocessing and deduplication, the dataset provides a high-quality resource enriched with metadata that captures the hierarchical structure and semantics of debate arguments. The experiments demonstrate that fine-tuning modern LLMs, especially with LoRA on larger models like LLaMA3-70B, yields significant and consistent performance improvements across OpenDebateEvidence, DebateSum, and BillSum. By providing this resource to the community, the authors aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. Future work includes expanding to more diverse debate formats, integrating multimodal data, and exploring cross-linguistic adaptations.
Stay Informed
Sign up for updates and never miss important announcements.
Join the community
Join our Discord server to get support or connect with the Pi² community.