Workshop at EMNLP-IJCNLP 2019
November 4th, 2019
Room MR 201B-C
Machine Reading for Question Answering (MRQA) has become an important testbed for evaluating how well computer systems understand human language, as well as a crucial technology for industry applications such as search engines and dialogue systems. In a typical MRQA setting, a system must answer a question by reading one or more context documents. Successful MRQA systems must understand a wide range of natural language phenomena, and a wide variety of question and document types. While recent progress on benchmark datasets has been impressive, models are still primarily evaluated on in-domain accuracy. It remains challenging to build MRQA systems that generalize to new test distributions (Chen et al., 2017, Levy et al., 2017, Yogatama et al., 2019) and are robust to test-time perturbations (Jia and Liang, 2017, Ribeiro et al., 2018).
To promote research on MRQA, particularly related to generalization, we seek submissions in two tracks: a research track and a new shared task track. Our shared task is specifically designed to test how well MRQA systems can generalize to new domains (see more details below).
10:10–10:30 | Best paper talk I: Multi-step Entity-centric Information Retrieval for Multi-Hop Question Answering
10:30–11:00 | Morning coffee break
11:35–12:10 | Shared task overview and results
12:10–12:30 | Shared task best system talk: D-NET: A Pre-Training and Fine-Tuning Framework for Improving the Generalization of Machine Reading Comprehension
12:30–14:00 | Lunch
14:00–14:20 | Best paper talk II: Evaluating Question Answering Evaluation
14:20–14:55 | Mohit Bansal, University of North Carolina at Chapel Hill
Interpretability and Robustness for Multi-Hop QA Slides
14:55–16:30 | Poster session and afternoon coffee break
16:30–17:30 | Panel discussion
This year, we are introducing a new MRQA Shared Task, which tests whether existing MRQA systems can generalize beyond the datasets on which they were trained. A truly effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to test examples drawn from different distributions.
Participants in the shared task will submit MRQA systems trained on a specified training dataset pooled from six existing large-scale datasets. Systems will be evaluated on their generalization to ten different test datasets. The test datasets will be in the same format as the training data, but may have different sources of document context (e.g., biology research papers) and questions (e.g., written by domain experts). We will release development sets for five of the test datasets, while keeping the other five test datasets hidden. This gives teams a way to measure progress during development, while discouraging them from designing specialized solutions for the particular test datasets we have chosen.
For more information, please see the Shared Task page.
Despite the rapid progress in MRQA, there is still much to understand about MRQA datasets and systems. While in-domain model accuracy is rapidly improving on these datasets, generalization suffers when models are evaluated on new domains and datasets. Focusing only on accuracy also obscures other important desiderata, including model interpretability, scalability, and robustness to perturbations. Similarly, the diversity of recent datasets calls for an analysis of the various natural language phenomena (coreference, paraphrasing, entailment, multi-hop reasoning) that these datasets present.
This track is broad in scope and seeks submissions in areas including, but not limited to:
For more information, please see Call for Papers.
This year, we have awarded three paper awards:
Please see Invitation to Sponsor MRQA for sponsorship details.