Generalizability of Argument Identification in Context 2026

Synopsis

Register for participation Join the Touché mailing list

Important Dates

  • 14.01.26: The training and development data is released.
  • 24.04.26: The test data is released.
  • 22.05.26: The leaderboard is released.
  • 22.05.26: The test labels are released.

Run submission extended to May 21st, 2026 (AoE).

See the CLEF 2026 homepage for other dates.

Subscribe to the Touché mailing list to receive notifications.

Task

Given a sentence from a dataset along with metadata about its provenance, such as the source text and the dataset's annotation guidelines, predict whether the sentence can be annotated as an argument or not. In particular, participants are encouraged to develop robust systems that generalize beyond lexical shortcuts to unseen datasets and investigate ways to exploit rich context information for this purpose.

Data

A subset from 10 established, publicly available benchmark datasets (ABSTRCT, ACQUA, AEC, AFS, ARGUMINSCI, FINARG, IAM, PE, SCIARK, USELEC), which this paper considers most relevant for argument identification, will be used. Each consists of 1.7k labeled sentences, partitioned with a 60/20/20 ratio into training, development, and test splits. Additionally, a new evaluation-only dataset will be released. Overall, the data includes sentences labeled as Argument or No-Argument, according to the respective dataset annotations. Accompanying metadata includes sentence IDs, generated splits, and (where available) context-relevant information via the original data sources, as well as annotation guidelines and corresponding papers.

Evaluation

Systems will primarily be evaluated on the newly created, evaluation-only dataset. This dataset comprises 340 test sentences from the enhanced and anonymized TACO dataset, each annotated with its original labels as well as labels aligned with the PE and USELEC guidelines (1.020 in total). For further insights, evaluation results on the established test splits from the held-out benchmark data will also be provided but not used for ranking. This setup addresses the risk of data contamination in LLMs and for participants’ potential use of additional datasets during training. To evaluate the systems for their generalizability, the macro F1-score will be measured for each test split, along with the overall average of all these values.

Submission

We ask participants to use TIRA for submissions. Each team can submit up to 3 approaches to the task.

The submissions for this task must be made as a run submission, meaning the test data will be provided in the same format as the training and development data, and participants must return their predictions.

Output Format

The output of the submission needs to be a JSONL file. Each line in the file must be in the following JSON format:

  • id: The ID of the sentence that was classified.
  • label: The label assigned by your classifier (Argument if the sentence is an argument and No-Argument otherwise).
Example JSONL file (click to see)

    {
        "id": "SCIARK-test-21",
        "label": "Argument"
    }
    {
        "id": "USELEC-test-171",
        "label": "No-Argument"
    }

Input Format

The input for the submission will also be provided as JSONL files (e.g., train.jsonl and train_labels.jsonl), which can be combined to follow the JSON structure below:

  • id: A unique sentence identifier composed of a dataset prefix, the split name, and a running number (e.g., ABSTRCT-train-1).
  • paper: A link to the corresponding dataset paper.
  • document: A link to the source document from which the sentence was extracted.
  • guidelines: A link to the annotation guidelines used to label the sentence.
  • label: The gold label, where Argument indicates an argument sentence and No-Argument otherwise.
  • sentence: The sentence itself.
Example JSONL file (click to see)

    {
        "id": "ABSTRCT-train-403",
        "paper": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/paper/ABSTRCT.pdf",
        "document": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/data/ABSTRCT-19.txt",
        "guidelines": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/guidelines/ABSTRCT-Guidelines.pdf",
        "label": "Argument",
        "sentence": "Therefore, single-fraction radiotherapy should be considered as the palliative treatment of choice for cancer patients with painful bone metastases."
    }
    {
        "id": "FINARG-dev-262",
        "paper": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/FINARG/paper/FINARG.pdf",
        "document": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/FINARG/data/FINARG-631.txt",
        "guidelines": "-",
        "label": "No-Argument",
        "sentence": "I can take the first one, Brian, on the time spent metric."
    }
    {
        "id":"TAUS-test-90",
        "paper":"./TACO/paper/TACO.pdf",
        "document":"./TACO/data/TACO-19.txt",
        "guidelines":"./TAUS/guidelines/TAUS-Guidelines.pdf",
        "label": "No-Argument",
        "sentence":"@Mrs._Diane_Reyes Theon killed was the cherry on top."
    }

Please Note

  • Links provided under paper and guidelines point to a PDF file.
  • Links provided under document point to a TXT file.
  • The respective train/dev/test splits will be published sequentially in ./data/ as separate files (e.g., train.jsonl).
  • Each train/dev/test split will have a separate file containing the labels (e.g., train_labels.jsonl).
  • The paths in each split are relative to the ./data/ working directory and point to the respective files (e.g., ./ABSTRCT/data/ABSTRCT-1.txt).
  • The three TACO test variants (TACO, TACO-PE, TACO-USELEC) will accompany the held-out data and use standard IDs (e.g., taco-test-1, tape-test-1, taus-test-1).
  • For TACO-PE and TACO-USELEC, the original guidelines were applied, and deviations are documented respectively (e.g., ./TAPE/guidelines/TAPE-Guidelines.pdf).
  • The data will be distributed via TIRA, but is already published in this GitHub repository.
  • If you encounter any issues, please feel free to contact any of the maintainers.

Related Work

Leaderboard

Condensed Leaderboard

The condensed leaderboard is shown below, as described in Evaluation. Main refers to the averaged macro F1-scores of the three new evaluation-only datasets used for the official ranking, while Supplementary denotes the average across the 10 publicly available training benchmarks.

team id name Main Supplementary
context-awakens 2026-05-08-15-04-50 Full GAIC Testset 0.7955 0.7308
arginvariant 2026-05-06-16-07-49 arginvariant_1 0.7834 0.7899
arginvariant 2026-05-06-16-08-30 arinvariant_2 0.7829 0.7533
the-wildcards 2026-05-18-16-29-20 hybrid 0.7822 0.8128
the-wildcards 2026-05-21-18-35-58 local 0.7693 0.8057
arginvariant 2026-05-06-16-09-54 arginvariant_3 0.7561 0.6531
code-doctors 2026-04-29-21-22-47 run_1 0.6502 0.8000
the-wildcards 2026-05-21-18-34-10 solo 0.5938 0.8148

Detailed Leaderboard

The detailed leaderboard used to compute the official scores in the previous section is shown below.

team id name ABSTRCT ACQUA AEC AFS ARGUMINSCI FINARG IAM PE SCIARK TACO TAPE TAUS USELEC
context-awakens 2026-05-08-15-04-50 Full GAIC Testset 0.8735 0.7882 0.9141 0.7119 0.7933 0.4764 0.7146 0.6646 0.6682 0.8408 0.7604 0.7853 0.7029
arginvariant 2026-05-06-16-07-49 arginvariant_1 0.8529 0.8529 0.9588 0.7475 0.8032 0.6705 0.7041 0.8137 0.7465 0.8133 0.7757 0.7612 0.7490
arginvariant 2026-05-06-16-08-30 arinvariant_2 0.8466 0.7968 0.9381 0.6441 0.7505 0.6372 0.6795 0.7881 0.7203 0.8133 0.7757 0.7598 0.7316
the-wildcards 2026-05-18-16-29-20 hybrid 0.9000 0.8617 0.9588 0.8265 0.8146 0.6728 0.7529 0.7706 0.8234 0.8265 0.7465 0.7735 0.7468
the-wildcards 2026-05-21-18-35-58 local 0.9000 0.7910 0.9588 0.8265 0.8146 0.6728 0.7529 0.7706 0.8234 0.7912 0.7506 0.7660 0.7468
arginvariant 2026-05-06-16-09-54 arginvariant_3 0.7924 0.7455 0.8076 0.5998 0.6132 0.5598 0.7041 0.4696 0.6760 0.8101 0.7204 0.7377 0.5632
code-doctors 2026-04-29-21-22-47 run_1 0.8587 0.8205 0.9559 0.7941 0.8265 0.6715 0.7382 0.7676 0.8056 0.6025 0.6748 0.6734 0.7617
the-wildcards 2026-05-21-18-34-10 solo 0.9000 0.8258 0.9588 0.8265 0.8382 0.6728 0.7529 0.7880 0.8234 0.5098 0.6216 0.6499 0.7617

Task Committee