Generalizability of Argument Identification in Context 2026
Synopsis
- Task: Decide if a sentence, in its context, constitutes an argument or not.
- Communication: [mailing lists: participants, organizers]
Important Dates
See the CLEF 2026 homepage.
Subscribe to the Touché mailing list to receive notifications.
Task
Given a sentence from a dataset along with metadata about its provenance, such as the source text and the dataset's annotation guidelines, predict whether the sentence can be annotated as an argument or not. In particular, participants are encouraged to develop robust systems that generalize beyond lexical shortcuts to unseen datasets and investigate ways to exploit rich context information for this purpose.
Data
A subset from 10 established, publicly available benchmark datasets (ABSTRCT, ACQUA, AEC, AFS, ARGUMINSCI, FINARG, IAM, PE, SCIARK, USELEC), which this paper considers most relevant for argument identification, will be used. Each consists of 1.7k labeled sentences, partitioned with a 60/20/20 ratio into training, development, and test splits. Additionally, a new evaluation-only dataset will be released. Overall, the data includes sentences labeled as Argument or No-Argument, according to the respective dataset annotations. Accompanying metadata includes sentence IDs, generated splits, and (where available) context-relevant information via the original data sources, as well as annotation guidelines and corresponding papers.
Evaluation
Systems will primarily be evaluated on the newly created, evaluation-only dataset. For further insights, evaluation results on the established test splits from the held-out benchmark data will also be provided but not used for ranking. This setup addresses the risk of data contamination in LLMs and for participants’ potential use of additional datasets during training. To evaluate the systems for their generalizability, the macro F$_1$-score will be measured for each test split, along with the overall average of all these values.
Submission
We ask participants to use TIRA for submissions. Each team can submit up to 3 approaches to the task.
The submissions for this task must be made as a run submission, meaning the test data will be provided in the same format as the training and development data, and participants must return their predictions.
Output Format
The output of the submission needs to be a JSONL file. Each line in the file must be in the following JSON format:
id: The ID of the sentence that was classified.label: The label assigned by your classifier (Argumentif the sentence is an argument andNo-Argumentotherwise).
Example JSONL file (click to see)
{
"id": "SCIARK-test-21",
"label": "Argument"
}
{
"id": "USELEC-test-171",
"label": "No-Argument"
}
Input Format
The input for the submission will also be provided as a JSONL file, where each line follows the JSON structure below:
id: A unique sentence identifier composed of a dataset prefix, the split name, and a running number (e.g.,ABSTRCT-train-1).paper: A link to the corresponding dataset paper.document: A link to the source document from which the sentence was extracted.guidelines: A link to the annotation guidelines used to label the sentence.label: The gold label, whereArgumentindicates an argument sentence andNo-Argumentotherwise.sentence: The sentence itself.
Example JSONL file (click to see)
{
"id": "ABSTRCT-train-403",
"paper": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/paper/ABSTRCT.pdf",
"document": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/data/ABSTRCT-19.txt",
"guidelines": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/guidelines/ABSTRCT-Guidelines.pdf",
"label": "Argument",
"sentence": "Therefore, single-fraction radiotherapy should be considered as the palliative treatment of choice for cancer patients with painful bone metastases."
}
{
"id": "FINARG-dev-262",
"paper": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/FINARG/paper/FINARG.pdf",
"document": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/FINARG/data/FINARG-631.txt",
"guidelines": "-",
"label": "No-Argument",
"sentence": "I can take the first one, Brian, on the time spent metric."
}
Please Note
- Links provided under
paperandguidelinespoint to a PDF file. - Links provided under
documentpoint to a TXT file. - The respective
train/dev/testsplits will be published sequentially in./data/as separate files (e.g.,train.jsonl). - Each
train/dev/testsplit will have a separate file containing the labels (e.g.,train_labels.jsonl). - The paths in each split are relative to the
./data/working directory and point to the respective files (e.g.,./ABSTRCT/data/ABSTRCT-1.txt). - The data will be distributed via TIRA, but is already published in this GitHub repository.
Related Work
- Marc Feger, Katarina Boland, and Stefan Dietze. Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, July 2025.
- Terne Sasha Thorn Jakobsen, Maria Barrett, and Anders Søgaard. Spurious Correlations in Cross-Topic Argument Mining. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, August 2021.
- Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Robert Zemel, Wieland Brendel, Matthias Bethge & Felix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, November 2020.
Task Committee



