Evidence Retrieval for Causal Questions 2023

Synopsis
Task
Data
Evaluation
Submission
Results
Task Committee

Synopsis

Task: Given a causality-related topic, the task is to retrieve and rank documents by relevance to the topic and detect the document "causal" stance.
Input: [topics]
Submission: [submit].

Task

The goal of Task 2 is to support users who want to understand whether a causal relationship between two events / actions exists. Given a causality-related topic and a collection of web documents, the task is to retrieve and rank documents by relevance to the topic and detect the document "causal" stance (i.e., whether the document supports, refutes, or provides no information about the title's causal statement.).

Data

Example topic for Task 2 (download topics):

<topic>
<number>1</number>
<title>Can eating broccoli lead to constipation?</title>
<cause>broccoli</cause>
<effect>constipation</effect>
<description>A young parent has a child experiencing constipation after eating some broccoli for dinner and is wondering [...]</description>
<narrative>Relevant documents will discuss if broccoli and other high fiber foods can cause or ease constipation [...]</narrative>
</topic>

The corpus for Task 2 is ClueWeb22 category B. You may index the ClueWeb22 with your favorite retrieval system. To ease participation, you may also directly use the ChatNoir search engine's Python API or PyTerrier wrapper for a baseline retrieval. You will receive credentials to access the ChatNoir API upon a completed registration in TIRA. Please obtain a license from CMU. (The $0 license is sufficient to work with ChatNoir.)

Evaluation

Our human assessors will label the retrieved documents according to two relevance dimensions: (1) whether the document is on topic, i.e., contains information about the causal relationship of the events in a question; the direction of causality will considered, e.g., a document stating that B causes A will be considered as off-topic for the question "Does A cause B?" and (2) if the document is on topic, whether the contained evidence is circumstantial (e.g., a single observation of the co-occurrence of two events) or general (e.g., a statement gained through inductive reasoning). Optionally, detect the causal stance: A document can provide supportive evidence (a causal relationship between the cause and effect from the topic holds), refutative (a causal relationship does not hold), neutral (in some cases holds and in some does not), or no evidence is entailed. We will use nDCG@5 to evaluate rankings and accuracy to evaluate stance detection.

Submission

We ask participants to use TIRA for result submissions.

Runs may be either automatic, semi-automatic, or manual. An automatic run must use only the topic title and not "manipulate" these via manual intervention. Semi-automatic runs may additionally use the <cause> and <effect> fields. A manual run is anything that is not an automatic or semi-automatic run. Upon submission, please let us know what the type of your runs is. For each topic, include up to 1,000 retrieved documents. Each team can submit up to 5 different runs.

The submission format for the task will follow the standard TREC format:

qid stance doc rank score tag

With:

qid: The topic number.
stance: The causal stance of the document (SUP: document provides evidence for causal relation, REF: causal relation provides evidence against causal relation, NEU: neutral stance, i.e., document provides inconclusive evidence or both supporting and refuting evidence, NO: document provides no evidence).
doc: The document ID qid.
rank: The rank the document is retrieved at.
score: The score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order. It is important to handle tied scores.
tag: A tag that identifies your group and the method you used to produce the run.

If you do not classify the stance, use Q0 as the value in the stance column. The fields should be separated by a whitespace. The individual columns' widths are not restricted (i.e., score can be an arbitrary precision that has no ties) but it is important to include all columns and to separate them with a whitespace.

An example run for Task 2 is:

1 SUP clueweb22-en0004-03-29836 1 17.89 myGroupMyMethod
1 REF clueweb22-en0010-05-00457 2 16.43 myGroupMyMethod
1 NEU clueweb22-en0000-00-00002 3 16.32 myGroupMyMethod
1 NO clueweb22-en0070-00-00123 4 15.22 myGroupMyMethod
...

TIRA Tutorial and TIRA Baselines

We provide a TIRA tutorial that provides baselines that can be executed in TIRA at https://github.com/touche-webis-de/touche-code/tree/main/clef23.

Results

Relevance results of all submitted runs.
Team	Tag	Mean nDCG@5	CI95 Low	CI95 High
He Man	heman_no_expansion_rerank	0.657	0.564	0.740
Puss In Boots	puss-in-boots_baseline	0.585	0.503	0.673
He Man	heman_gpt_expansion_rerank	0.374	0.284	0.469
He Man	heman_causenet_expansion_rerank	0.268	0.172	0.368