Argument Retrieval for Controversial Questions 2023

Synopsis
Task
Data
Evaluation
Submission
Results
Task Committee

Synopsis

Task: Given a controversial topic, the task is to retrieve and rank documents by their relevance and argument quality and to detect the document stance.
Input: [topics].
Submission: [submit].

Task

The goal of Task 1 is to provide an overview of arguments and opinions on controversial topics. Given a controversial topic and a collection of web documents, the task is to retrieve and rank documents by relevance to the topic, by argument quality, and to detect the document stance. Participants of Task 1 will retrieve and rank documents that contain relevant high-quality arguments contained in the ClueWeb22 crawl for a given set of 50 search topics.

Data

Example topic for Task 1 (download topics):

<topic>
<number>1</number>
<title>Should teachers get tenure?</title>
<description>A user has heard that some countries do give teachers tenure and others don't. 
Interested in the reasoning for or against tenure, the user searches for positive and negative arguments [...]</description>
<narrative>Highly relevant arguments make a clear statement about tenure for teachers in schools or universities [...]</narrative>
</topic>

The corpus for Task 1 is ClueWeb22 category B. You may index the ClueWeb22 with your favorite retrieval system. To ease participation, you may also directly use the ChatNoir search engine's Python API or PyTerrier wrapper for a baseline retrieval. You will receive credentials to access the ChatNoir API upon a completed registration in TIRA. Please obtain a license from CMU. (The $0 license is sufficient to work with ChatNoir.)

Additional resources:

Argument relevance and quality judgments: [Touché qrels].
Argument mining tool: [TARGER].

Evaluation

Our human assessors will label the ranked results both for their general topical relevance and for the rhetorical argument quality [paper], i.e., "well-writtennes": (1) whether the document contains arguments and whether the argument text has a good style of speech, (2) whether the text has a proper sentence structure and is easy to follow, (3) whether it includes profanity, has typos, etc. Optionally, detect the documents' stance: pro, con, neutral, or no stance towards the search topic. We will use nDCG@10 to evaluate rankings and an macro-averaged F1 to evaluate stance detection.

Submission

We ask participants to use TIRA for result submissions.

Runs may be either automatic or manual. An automatic run must not "manipulate" the topic titles via manual intervention. A manual run is anything that is not an automatic run. Upon submission, please let us know which of your runs are manual. For each topic, include up to 1,000 retrieved documents. Each team can submit up to 5 different runs.

The submission format for the task will follow the standard TREC format:

qid stance doc rank score tag

With:

qid: The topic number.
stance: The stance of the document (PRO: supports the topic, CON: against the topic, NEU: neutral stance, NO: no stance).
doc: The document ID qid.
rank: The rank the document is retrieved at.
score: The score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order. It is important to handle tied scores.
tag: A tag that identifies your group and the method you used to produce the run.

If you do not classify the stance, use Q0 as the value in the stance column. The fields should be separated by a whitespace. The individual columns' widths are not restricted (i.e., score can be an arbitrary precision that has no ties) but it is important to include all columns and to separate them with a whitespace.

An example run for Task 1 is:

1 PRO clueweb22-en0004-03-29836 1 17.89 myGroupMyMethod
1 CON clueweb22-en0010-05-00457 2 16.43 myGroupMyMethod
1 NEU clueweb22-en0000-00-00002 3 16.32 myGroupMyMethod
1 NO clueweb22-en0070-00-00123 4 15.22 myGroupMyMethod
...

TIRA Tutorial and TIRA Baselines

We provide a TIRA tutorial that provides baselines that can be executed in TIRA at https://github.com/touche-webis-de/touche-code/tree/main/clef23.

Results

Relevance results of all submitted runs.
Team	Tag	Mean nDCG@10	CI95 Low	CI95 High
Puss in Boots	puss-in-boots_baseline	0.834	0.791	0.875
Renji Abarai	renji_abarai_stance_ChatGPT	0.747	0.687	0.812
Renji Abarai	renji_abarai_stance-certainNO_ChatGPT	0.746	0.678	0.810
Renji Abarai	renji_abarai_ChatGPT_mmGhl	0.718	0.653	0.775
Renji Abarai	renji_abarai_ChatGPT_mmEQhl	0.718	0.650	0.779
Renji Abarai	renji_abarai_meta_qual_score	0.712	0.641	0.782
Renji Abarai	renji_abarai_baseline	0.708	0.632	0.775
Renji Abarai	renji_abarai_meta_qual_prob	0.697	0.622	0.765

Quality results of all submitted runs.
Team	Tag	Mean nDCG@10	CI95 Low	CI95 High
Puss in Boots	puss-in-boots_baseline	0.831	0.786	0.873
Renji Abarai	renji_abarai_stance_ChatGPT	0.815	0.764	0.862
Renji Abarai	renji_abarai_stance-certainNO_ChatGPT	0.811	0.754	0.863
Renji Abarai	renji_abarai_ChatGPT_mmEQhl	0.789	0.730	0.846
Renji Abarai	renji_abarai_ChatGPT_mmGhl	0.789	0.731	0.842
Renji Abarai	renji_abarai_meta_qual_prob	0.774	0.712	0.830
Renji Abarai	renji_abarai_meta_qual_score	0.771	0.710	0.832
Renji Abarai	renji_abarai_baseline	0.766	0.698	0.823

Stance results of all submitted runs. Since Renji Abarai re-ranked the same set of documents for all the runs, this yields identical stance detection results.
Team	Tag	F1 macro (run)	N (run)	F1 macro (team)	N (team)
Renji Abarai	renji_abarai_ChatGPT_mmEQhl	0.599	500	0.599	500
Renji Abarai	renji_abarai_ChatGPT_mmGhl	0.599	500	0.599	500
Renji Abarai	renji_abarai_baseline	0.599	500	0.599	500
Renji Abarai	renji_abarai_meta_qual_prob	0.599	500	0.599	500
Renji Abarai	renji_abarai_meta_qual_score	0.599	500	0.599	500
Renji Abarai	renji_abarai_stance-certainNO_ChatGPT	0.599	500	0.599	500
Renji Abarai	renji_abarai_stance_ChatGPT	0.599	500	0.599	500
Puss in Boots	puss-in-boots_baseline	0.203	500	0.203	500