Argument Retrieval for Controversial Questions 2023
Synopsis
- Task: Given a controversial topic, the task is to retrieve and rank documents by their relevance and argument quality and to detect the document stance.
- Input: [topics].
- Submission: [submit].
Task
The goal of Task 1 is to provide an overview of arguments and opinions on controversial topics. Given a controversial topic and a collection of web documents, the task is to retrieve and rank documents by relevance to the topic, by argument quality, and to detect the document stance. Participants of Task 1 will retrieve and rank documents that contain relevant high-quality arguments contained in the ClueWeb22 crawl for a given set of 50 search topics.
Register nowData
Example topic for Task 1 (download topics):
<topic>
<number>1</number>
<title>Should teachers get tenure?</title>
<description>A user has heard that some countries do give teachers tenure and others don't.
Interested in the reasoning for or against tenure, the user searches for positive and negative arguments [...]</description>
<narrative>Highly relevant arguments make a clear statement about tenure for teachers in schools or universities [...]</narrative>
</topic>
The corpus for Task 1 is ClueWeb22 category B. You may index the ClueWeb22 with your favorite retrieval system. To ease participation, you may also directly use the ChatNoir search engine's Python API or PyTerrier wrapper for a baseline retrieval. You will receive credentials to access the ChatNoir API upon a completed registration in TIRA. Please obtain a license from CMU. (The $0 license is sufficient to work with ChatNoir.)
Additional resources:
- Argument relevance and quality judgments: [Touché qrels].
- Argument mining tool: [TARGER].
Evaluation
Our human assessors will label the ranked results both for their general topical relevance and for the rhetorical argument quality [paper], i.e., "well-writtennes": (1) whether the document contains arguments and whether the argument text has a good style of speech, (2) whether the text has a proper sentence structure and is easy to follow, (3) whether it includes profanity, has typos, etc. Optionally, detect the documents' stance: pro, con, neutral, or no stance towards the search topic. We will use nDCG@10 to evaluate rankings and an macro-averaged F1 to evaluate stance detection.
Submission
We ask participants to use TIRA for result submissions.
Runs may be either automatic or manual. An automatic run must not "manipulate" the topic titles via manual intervention. A manual run is anything that is not an automatic run. Upon submission, please let us know which of your runs are manual. For each topic, include up to 1,000 retrieved documents. Each team can submit up to 5 different runs.
The submission format for the task will follow the standard TREC format:
qid stance doc rank score tag
With:
qid
: The topic number.stance
: The stance of the document (PRO: supports the topic, CON: against the topic, NEU: neutral stance, NO: no stance).doc
: The document IDqid
.rank
: The rank the document is retrieved at.score
: The score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order. It is important to handle tied scores.tag
: A tag that identifies your group and the method you used to produce the run.
If you do not classify the stance, use Q0
as the value in the stance column. The fields should be separated by a whitespace. The individual columns' widths are not restricted (i.e., score can be an arbitrary precision that has no ties) but it is important to include all columns and to separate them with a whitespace.
An example run for Task 1 is:
1 PRO clueweb22-en0004-03-29836 1 17.89 myGroupMyMethod
1 CON clueweb22-en0010-05-00457 2 16.43 myGroupMyMethod
1 NEU clueweb22-en0000-00-00002 3 16.32 myGroupMyMethod
1 NO clueweb22-en0070-00-00123 4 15.22 myGroupMyMethod
...
TIRA Tutorial and TIRA Baselines
We provide a TIRA tutorial that provides baselines that can be executed in TIRA at https://github.com/touche-webis-de/touche-code/tree/main/clef23.
Results
Team | Tag | Mean nDCG@10 | CI95 Low | CI95 High |
---|---|---|---|---|
Puss in Boots | puss-in-boots_baseline | 0.834 | 0.791 | 0.875 |
Renji Abarai | renji_abarai_stance_ChatGPT | 0.747 | 0.687 | 0.812 |
Renji Abarai | renji_abarai_stance-certainNO_ChatGPT | 0.746 | 0.678 | 0.810 |
Renji Abarai | renji_abarai_ChatGPT_mmGhl | 0.718 | 0.653 | 0.775 |
Renji Abarai | renji_abarai_ChatGPT_mmEQhl | 0.718 | 0.650 | 0.779 |
Renji Abarai | renji_abarai_meta_qual_score | 0.712 | 0.641 | 0.782 |
Renji Abarai | renji_abarai_baseline | 0.708 | 0.632 | 0.775 |
Renji Abarai | renji_abarai_meta_qual_prob | 0.697 | 0.622 | 0.765 |
Team | Tag | Mean nDCG@10 | CI95 Low | CI95 High |
---|---|---|---|---|
Puss in Boots | puss-in-boots_baseline | 0.831 | 0.786 | 0.873 |
Renji Abarai | renji_abarai_stance_ChatGPT | 0.815 | 0.764 | 0.862 |
Renji Abarai | renji_abarai_stance-certainNO_ChatGPT | 0.811 | 0.754 | 0.863 |
Renji Abarai | renji_abarai_ChatGPT_mmEQhl | 0.789 | 0.730 | 0.846 |
Renji Abarai | renji_abarai_ChatGPT_mmGhl | 0.789 | 0.731 | 0.842 |
Renji Abarai | renji_abarai_meta_qual_prob | 0.774 | 0.712 | 0.830 |
Renji Abarai | renji_abarai_meta_qual_score | 0.771 | 0.710 | 0.832 |
Renji Abarai | renji_abarai_baseline | 0.766 | 0.698 | 0.823 |
Team | Tag | F1 macro (run) | N (run) | F1 macro (team) | N (team) |
---|---|---|---|---|---|
Renji Abarai | renji_abarai_ChatGPT_mmEQhl | 0.599 | 500 | 0.599 | 500 |
Renji Abarai | renji_abarai_ChatGPT_mmGhl | 0.599 | 500 | 0.599 | 500 |
Renji Abarai | renji_abarai_baseline | 0.599 | 500 | 0.599 | 500 |
Renji Abarai | renji_abarai_meta_qual_prob | 0.599 | 500 | 0.599 | 500 |
Renji Abarai | renji_abarai_meta_qual_score | 0.599 | 500 | 0.599 | 500 |
Renji Abarai | renji_abarai_stance-certainNO_ChatGPT | 0.599 | 500 | 0.599 | 500 |
Renji Abarai | renji_abarai_stance_ChatGPT | 0.599 | 500 | 0.599 | 500 |
Puss in Boots | puss-in-boots_baseline | 0.203 | 500 | 0.203 | 500 |