Argument Retrieval for Controversial Questions 2023


  • Task: Given a controversial topic, the task is to retrieve and rank documents by their relevance and argument quality and to detect the document stance.
  • Input: [topics].
  • Submission: [submit].


The goal of Task 1 is to provide an overview of arguments and opinions on controversial topics. Given a controversial topic and a collection of web documents, the task is to retrieve and rank documents by relevance to the topic, by argument quality, and to detect the document stance. Participants of Task 1 will retrieve and rank documents that contain relevant high-quality arguments contained in the ClueWeb22 crawl for a given set of 50 search topics.

Register now


Example topic for Task 1 (download topics):

<title>Should teachers get tenure?</title>
<description>A user has heard that some countries do give teachers tenure and others don't. 
Interested in the reasoning for or against tenure, the user searches for positive and negative arguments [...]</description>
<narrative>Highly relevant arguments make a clear statement about tenure for teachers in schools or universities [...]</narrative>

The corpus for Task 1 is ClueWeb22 category B. You may index the ClueWeb22 with your favorite retrieval system. To ease participation, you may also directly use the ChatNoir search engine's Python API or PyTerrier wrapper for a baseline retrieval. You will receive credentials to access the ChatNoir API upon a completed registration in TIRA. Please obtain a license from CMU. (The $0 license is sufficient to work with ChatNoir.)

Additional resources:


Our human assessors will label the ranked results both for their general topical relevance and for the rhetorical argument quality [paper], i.e., "well-writtennes": (1) whether the document contains arguments and whether the argument text has a good style of speech, (2) whether the text has a proper sentence structure and is easy to follow, (3) whether it includes profanity, has typos, etc. Optionally, detect the documents' stance: pro, con, neutral, or no stance towards the search topic. We will use nDCG@10 to evaluate rankings and an macro-averaged F1 to evaluate stance detection.


We ask participants to use TIRA for result submissions.

Runs may be either automatic or manual. An automatic run must not "manipulate" the topic titles via manual intervention. A manual run is anything that is not an automatic run. Upon submission, please let us know which of your runs are manual. For each topic, include up to 1,000 retrieved documents. Each team can submit up to 5 different runs.

The submission format for the task will follow the standard TREC format:

qid stance doc rank score tag


  • qid: The topic number.
  • stance: The stance of the document (PRO: supports the topic, CON: against the topic, NEU: neutral stance, NO: no stance).
  • doc: The document ID qid.
  • rank: The rank the document is retrieved at.
  • score: The score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order. It is important to handle tied scores.
  • tag: A tag that identifies your group and the method you used to produce the run.

If you do not classify the stance, use Q0 as the value in the stance column. The fields should be separated by a whitespace. The individual columns' widths are not restricted (i.e., score can be an arbitrary precision that has no ties) but it is important to include all columns and to separate them with a whitespace.

An example run for Task 1 is:

1 PRO clueweb22-en0004-03-29836 1 17.89 myGroupMyMethod
1 CON clueweb22-en0010-05-00457 2 16.43 myGroupMyMethod
1 NEU clueweb22-en0000-00-00002 3 16.32 myGroupMyMethod
1 NO clueweb22-en0070-00-00123 4 15.22 myGroupMyMethod

TIRA Tutorial and TIRA Baselines

We provide a TIRA tutorial that provides baselines that can be executed in TIRA at


Relevance results of all submitted runs.
Team Tag Mean nDCG@10 CI95 Low CI95 High
Puss in Boots puss-in-boots_baseline 0.834 0.791 0.875
Renji Abarai renji_abarai_stance_ChatGPT 0.747 0.687 0.812
Renji Abarai renji_abarai_stance-certainNO_ChatGPT 0.746 0.678 0.810
Renji Abarai renji_abarai_ChatGPT_mmGhl 0.718 0.653 0.775
Renji Abarai renji_abarai_ChatGPT_mmEQhl 0.718 0.650 0.779
Renji Abarai renji_abarai_meta_qual_score 0.712 0.641 0.782
Renji Abarai renji_abarai_baseline 0.708 0.632 0.775
Renji Abarai renji_abarai_meta_qual_prob 0.697 0.622 0.765
Quality results of all submitted runs.
Team Tag Mean nDCG@10 CI95 Low CI95 High
Puss in Boots puss-in-boots_baseline 0.831 0.786 0.873
Renji Abarai renji_abarai_stance_ChatGPT 0.815 0.764 0.862
Renji Abarai renji_abarai_stance-certainNO_ChatGPT 0.811 0.754 0.863
Renji Abarai renji_abarai_ChatGPT_mmEQhl 0.789 0.730 0.846
Renji Abarai renji_abarai_ChatGPT_mmGhl 0.789 0.731 0.842
Renji Abarai renji_abarai_meta_qual_prob 0.774 0.712 0.830
Renji Abarai renji_abarai_meta_qual_score 0.771 0.710 0.832
Renji Abarai renji_abarai_baseline 0.766 0.698 0.823
Stance results of all submitted runs. Since Renji Abarai re-ranked the same set of documents for all the runs, this yields identical stance detection results.
Team Tag F1 macro (run) N (run) F1 macro (team) N (team)
Renji Abarai renji_abarai_ChatGPT_mmEQhl 0.599 500 0.599 500
Renji Abarai renji_abarai_ChatGPT_mmGhl 0.599 500 0.599 500
Renji Abarai renji_abarai_baseline 0.599 500 0.599 500
Renji Abarai renji_abarai_meta_qual_prob 0.599 500 0.599 500
Renji Abarai renji_abarai_meta_qual_score 0.599 500 0.599 500
Renji Abarai renji_abarai_stance-certainNO_ChatGPT 0.599 500 0.599 500
Renji Abarai renji_abarai_stance_ChatGPT 0.599 500 0.599 500
Puss in Boots puss-in-boots_baseline 0.203 500 0.203 500

Task Committee