Retrieval-Augmented Debating 2025

Synopsis
Important Dates
Task
Data
Submission
Task Committee

Synopsis

Sub-Task 1: Generate responses to argue against a simulated debate partner.
Sub-Task 2: Evaluate systems of sub-task 1.
Communication: [mailing lists: participants, organizers]
Data: [arguments: api, index, image: registry.webis.de/code-lib/public-images/touche25-rad-claimrev-index:latest, example queries] [training claims]
Submission: [examples: sub-task 1 elastic in javascript, sub-task 1 elastic in python] [sub-task 2 1-baseline in python] [forum] [submit]

Important Dates

Subscribe to the Touché mailing list to receive notifications.

2024-11: CLEF Registration opened [register]
2025-05-23: Approaches submission deadline [submit]
2025-05-30: Participant paper submission [paper template + submission instructions]
2025-06-10: Peer review notification
2025-07-07: Camera-ready participant papers submission
2025-09: CLEF Conference in Madrid and Touché Workshop

All deadlines are 23:59 CEST (UTC+2).

Task

This task serves to develop generative retrieval systems that argue against their users to support users in forming or confirming opinions or to train their debating skills. Participating systems are debated by simulated users in multiple turns (following the procedure shown below) and evaluated based on their responses.

U₁: Claim statement

S₁: Supposed to attack U₁

U₂: Attacks S₁

S₂: Supposed to respond to U₂

U₃: Attacks S₁ or S₂

S₃: Supposed to respond to U₃

U₄: Attacks S₁ or S₂ or S₃

S₄: Supposed to respond to U₄

Debate procedure for sub-task 1. The simulated user always starts by stating a claim and later attacks the system's responses. The system is expected to respond, either by counterattacking or defending.

Sub-Task 1: Participants submit systems that respond with an utterance S_i, to a (simulated) user's utterance U_i by (1) retrieving either counterarguments (to U_i) or supporting evidence (for the attacked S) from a provided argument collection and (2) generating a response (of at most 60 words) from the retrieved data.

Sub-Task 2: Participants submit systems that assess the participating systems of sub-task 1 in terms of one or more of these criteria:

Quantity: be informative. Does the response contain at least one (attack or defense) argument, and at most one of each type of defense and attack?
Quality: be truthful. Can the response be deduced from the retrieved arguments?
Relation: be relevant. Is the response coherent with the conversation and does it express a contrary stance to the user?
Manner: be clear. Is the response clear and precise?

Join the Touché mailing list to stay up-to-date.

Data

Debate systems (Sub-Task 1) have to use our argument index to retrieve arguments for the simulated debates. We further provide a set of 100 curated claims sourced from the ChangeMyView subreddit and simulated debates (judged by us) for training debate systems (Sub-Task 1) or evaluation systems (Sub-Task 2).

Argument Index

We provide a collection of about 300 000 arguments from the ClaimRev dataset. Arguments can be retrieved either by making requests to our publicly available Elasticsearch API at https://touche25-rad.webis.de/arguments/ (example queries) or by deploying a local instance of Elasticsearch either using our Docker image with the index included (registry.webis.de/code-lib/public-images/touche25-rad-claimrev-index:latest) or our index snapshot.

An argument from the index consists of the following fields:

id: Id of the argument from the Claimrev dataset.
topic: A general topic the arguments belongs to (e.g., Does pineapple belong on pizza?).
tags: A list of categories associated with the topic (e.g., [Food, Pizza]).
attacks: An argument that is attacked by this argument.
attacks_embedding_stella: A 1024-dimensional dense embedding vector of the attacks field.
supports: An argument that is supported by this argument.
supports_embedding_stella: A 1024-dimensional dense embedding vector of the supports field.
text: The actual text of the argument.
text_embedding_stella: A 1024-dimensional dense embedding vector of the text field.
references: A list of references to back up the argument.
original: This field indicates if either attacks or supports was part of the ClaimRev dataset or was automatically generated by us.

The embeddings have been created by the stella_en_400M_v5 model which is available on Huggingface. These embeddings enable dense retrieval through kNN search with Elasticsearch (example query).

Example query to find arguments that support pineapple belongs on pizza using BM25.

Request


curl -X POST "https://touche25-rad.webis.de/arguments/claimrev/_search?pretty&size=10" -H "Content-Type: application/json" -d '
{
  "query": {
    "match": {
      "supports": {
        "query": "pineapple on pizza",
        "operator": "or"
      }
    }
  },
  "_source": {
    "excludes": ["text_embedding_stella", "supports_embedding_stella", "attacks_embedding_stella"]
  }
}'

Response


{
  "took": 35,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score": 35.958282,
    "hits" : [
      {
        "_index" : "claimrev",
        "_id" : "10104.4",
        "_score": 35.958282,
        "_source" : {
          "id": "10104.4",
          "topic": "Does pineapple belong on pizza?",
          "tags": [
            "Food",
            "Pizza",
            "Hawaiian",
            "Pineapple"
          ],
          "attacks": "It's not always a good thing to have another recipe to make.",
          "supports": "Pineapple belongs on pizza.",
          "text": "From a culinary point of view, it's always a good thing to have another recipe to make.",
          "references": [],
          "original": "supports"
        }
      },
      {
        "_index" : "claimrev",
        "_id" : "10104.349",
        "_score": 35.958282,
        "_source" : {
          "id": "10104.349",
          "topic": "Does pineapple belong on pizza?",
          "tags": [
            "Food",
            "Pizza",
            "Hawaiian",
            "Pineapple"
          ],
          "attacks": "Pineapples do not make pizza aesthetically pleasing.",
          "supports": "Pineapple belongs on pizza.",
          "text": "Pineapples make pizza aesthetically pleasing.",
          "references": [],
          "original": "supports"
        }
      },
      {
        "_index" : "claimrev",
        "_id" : "10104.425",
        "_score": 35.958282,
        "_source" : {
          "id": "10104.425",
          "topic": "Does pineapple belong on pizza?",
          "tags": [
            "Food",
            "Pizza",
            "Hawaiian",
            "Pineapple"
          ],
          "attacks": "It is a coincidence that pineapple and pizza both start with pi.",
          "supports": "Pineapple belongs on pizza.",
          "text": "It's no coincidence that pineapple and pizza both start with pi.",
          "references": [],
          "original": "supports"
        }
      }
    ]
  }
}

Claims and Debates

We provide 100 claims from ChangeMyView, debates simulated based on these, and human judgment of the system responses on Zenodo (see there for more information).

Example claim about school uniforms.


{
  "id": "touche25-rad-cmv-13fhkeg",
  "claim": "School uniforms are a good thing.",
  "description": "Firstly, school uniforms help to promote a sense of unity and belonging among students. When everyone is dressed in the same attire, it eliminates any competition or judgement based on clothing choices. This can lead to a more inclusive and positive school environment.\nAdditionally, school uniforms can help to reduce bullying and discrimination. When everyone is wearing the same clothes, there is less opportunity for students to make fun of others who may not have the latest fashion trends or expensive clothing brands.\nFurthermore, school uniforms can help to increase academic performance. Without the pressure of trying to keep up with fashion trends, students can focus more on their studies and less on their appearance. This can lead to better grades and overall success in school.\n\nOverall, while some may argue that school uniforms limit individual expression, I believe that the benefits they provide in terms of promoting unity, reducing bullying, and improving academic performance make them a worthwhile investment. It's school, not a fashion show."
}

Submission

Submit your approach via TIRA. Ask in the Forum in case of problems. You need to register your team (in addition to a registration at CLEF) and pick an alias for your team name (submission is anonymous; you can reveal you true team name after final paper acceptance).

This task allows for code submissions and Docker submissions: see the instructions in TIRA. To test your submission use the datasets rad25-2025-01-16-toy for sub-task 1 (two topics) and rad25-sub-task-2-toy for sub-task 2 (two debates).

We highly recommend to start from our example systems: for Sub-task 1 in JavaScript, in Python and for sub-task 2 in Python]. They all provide endpoints for a GenIRSim service that runs in the background and is automatically started as they extend our base image. By adapting the examples, you do not need to care about such a service and can focus on providing the endpoints, e.g., for sub-task 1 in JavaScript and in Python and for sub-task 2 in Python.

Results

Sub-Task 1

We report results of sub-task 1 as the percentage of responses in the test debates that fulfill the specific criterion (quantity, quality, relation, or manner).

Submitted run of each team for sub-task 1.
Team	Run	Score (avg)	Quantity	Quality	Relation	Manner
Team DS@GT	gpt-4.1	0.70	0.95	0.17	0.82	0.84
Team DS@GT	gemini-2.5	0.65	0.94	0.26	0.74	0.67
Baseline	baseline	0.62	0.35	1.00	0.32	0.80
Team SINAI	run	0.54	0.70	0.02	0.86	0.59
Team DS@GT	gemini-2.5-flash	0.50	0.70	0.07	0.80	0.41
Team DS@GT	claude-opus-4	0.42	0.41	0.31	0.87	0.09
Team DS@GT	gpt-4o	0.42	0.20	0.02	0.86	0.58
Team DS@GT	claude-sonnet-4	0.38	0.35	0.05	0.94	0.17

Sub-Task 2

We report results of sub-task 2 as precision (P), recall (R) and F1-score for the task of classifying for each response in the test debates whether it fulfills the specific criterion (quantity, quality, relation, or manner).

Submitted run of each team for sub-task 2.
Team	Run	Score (F1)	Quantity			Quality			Relation			Manner
			P	R	F1	P	R	F1	P	R	F1	P	R	F1
Baseline	1-baseline	0.67	0.57	1.00	0.73	0.24	1.00	0.38	0.78	1.00	0.87	0.52	1.00	0.68
Team DS@GT	gemini-2.5-flash	0.64	0.59	0.86	0.70	0.18	0.66	0.29	0.81	0.99	0.89	0.52	0.99	0.68
Team DS@GT	gpt-4o	0.64	0.59	0.88	0.71	0.17	0.63	0.27	0.82	0.99	0.89	0.52	0.97	0.67
Team DS@GT	gpt-4.1	0.62	0.58	0.75	0.65	0.15	0.52	0.24	0.82	0.98	0.90	0.52	0.99	0.68
Team DS@GT	gemini-2.5-pro	0.62	0.59	0.67	0.63	0.17	0.52	0.25	0.84	0.97	0.90	0.52	0.98	0.68
Team SINAI	gritty-stock	0.56	0.60	0.60	0.60	0.19	0.40	0.25	0.84	0.86	0.85	0.50	0.57	0.53
Team DS@GT	claude-sonnet-4	0.56	0.56	0.43	0.49	0.15	0.36	0.21	0.83	0.92	0.88	0.51	0.93	0.66
Team SINAI	staff-frame	0.55	0.59	0.64	0.61	0.16	0.32	0.21	0.84	0.80	0.82	0.52	0.64	0.57
Team SINAI	radiant-tread	0.54	0.58	0.53	0.55	0.20	0.35	0.25	0.87	0.75	0.81	0.53	0.56	0.54
Team SINAI	iron-rhythm	0.52	0.57	0.46	0.51	0.15	0.37	0.21	0.84	0.79	0.81	0.50	0.63	0.56
Team DS@GT	claude-opus-4	0.51	0.49	0.21	0.29	0.16	0.31	0.21	0.85	0.90	0.88	0.51	0.92	0.66
Team SINAI	grating-dragster	0.49	0.59	0.63	0.61	0.20	0.58	0.30	0.84	0.39	0.53	0.50	0.54	0.52
Team SINAI	coped-message	0.39	0.57	0.32	0.41	0.17	0.21	0.19	0.84	0.67	0.74	0.45	0.16	0.24
Team SINAI	sizzling-coulomb	0.35	0.63	0.40	0.49	0.16	0.17	0.16	0.84	0.44	0.58	0.41	0.10	0.16