Generalizability of Argument Identification in Context 2026

Synopsis

  • Task: Decide if a sentence, in its context, constitutes an argument or not.
  • Communication: [mailing lists: participants, organizers]
Register for participation Join the Touché mailing list

Important Dates

See the CLEF 2026 homepage.

Subscribe to the Touché mailing list to receive notifications.

Task

Given a sentence from a dataset along with metadata about its provenance, such as the source text and the dataset's annotation guidelines, predict whether the sentence can be annotated as an argument or not. In particular, participants are encouraged to develop robust systems that generalize beyond lexical shortcuts to unseen datasets and investigate ways to exploit rich context information for this purpose.

Data

A subset from 10 established, publicly available benchmark datasets (ABSTRCT, ACQUA, AEC, AFS, ARGUMINSCI, FINARG, IAM, PE, SCIARK, USELEC), which this paper considers most relevant for argument identification, will be used. Each consists of 1.7k labeled sentences, partitioned with a 60/20/20 ratio into training, development, and test splits. Additionally, a new evaluation-only dataset will be released. Overall, the data includes sentences labeled as Argument or No-Argument, according to the respective dataset annotations. Accompanying metadata includes sentence IDs, generated splits, and (where available) context-relevant information via the original data sources, as well as annotation guidelines and corresponding papers.

Evaluation

Systems will primarily be evaluated on the newly created, evaluation-only dataset. For further insights, evaluation results on the established test splits from the held-out benchmark data will also be provided but not used for ranking. This setup addresses the risk of data contamination in LLMs and for participants’ potential use of additional datasets during training. To evaluate the systems for their generalizability, the macro F$_1$-score will be measured for each test split, along with the overall average of all these values.

Submission

We ask participants to use TIRA for submissions. Each team can submit up to 3 approaches to the task.

The submissions for this task must be made as a run submission, meaning the test data will be provided in the same format as the training and development data, and participants must return their predictions.

Output Format

The output of the submission needs to be a JSONL file. Each line in the file must be in the following JSON format:

  • id: The ID of the sentence that was classified.
  • label: The label assigned by your classifier (Argument if the sentence is an argument and No-Argument otherwise).
Example JSONL file (click to see)

    {
        "id": "SCIARK-test-21",
        "label": "Argument"
    }
    {
        "id": "USELEC-test-171",
        "label": "No-Argument"
    }

Input Format

The input for the submission will also be provided as a JSONL file, where each line follows the JSON structure below:

  • id: A unique sentence identifier composed of a dataset prefix, the split name, and a running number (e.g., ABSTRCT-train-1).
  • paper: A link to the corresponding dataset paper.
  • document: A link to the source document from which the sentence was extracted.
  • guidelines: A link to the annotation guidelines used to label the sentence.
  • label: The gold label, where Argument indicates an argument sentence and No-Argument otherwise.
  • sentence: The sentence itself.
Example JSONL file (click to see)

    {
        "id": "ABSTRCT-train-403",
        "paper": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/paper/ABSTRCT.pdf",
        "document": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/data/ABSTRCT-19.txt",
        "guidelines": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/ABSTRCT/guidelines/ABSTRCT-Guidelines.pdf",
        "label": "Argument",
        "sentence": "Therefore, single-fraction radiotherapy should be considered as the palliative treatment of choice for cancer patients with painful bone metastases."
    }
    {
        "id": "FINARG-dev-262",
        "paper": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/FINARG/paper/FINARG.pdf",
        "document": "https://raw.githubusercontent.com/TomatenMarc/GAIC-2026/refs/heads/main/data/FINARG/data/FINARG-631.txt",
        "guidelines": "-",
        "label": "No-Argument",
        "sentence": "I can take the first one, Brian, on the time spent metric."
    }

Please Note

  • Links provided under paper and guidelines point to a PDF file.
  • Links provided under document point to a TXT file.
  • The respective train/dev/test splits will be published sequentially in ./data/ as separate files (e.g., train.jsonl).
  • Each train/dev/test split will have a separate file containing the labels (e.g., train_labels.jsonl).
  • The paths in each split are relative to the ./data/ working directory and point to the respective files (e.g., ./ABSTRCT/data/ABSTRCT-1.txt).
  • The data will be distributed via TIRA, but is already published in this GitHub repository.

Related Work

Task Committee