Workshop at EMNLP 2019
Machine Reading for Question Answering (MRQA) has become an important testbed for evaluating how well computer systems understand human language, as well as a crucial technology for industry applications such as search engines and dialogue systems. Successful MRQA systems must deal with a wide range of natural language phenomena, such as lexical semantics, paraphrasing/entailment, coreference, and pragmatic reasoning, to answer questions based on text.
Recognizing the potential of MRQA as a comprehensive language understanding benchmark, the research community has recently created a multitude of large-scale datasets over text sources such as Wikipedia, news articles, fictional stories, and general web sources. These new datasets have in turn inspired an even wider array of new question answering systems.
Despite this rapid progress, there is still much to understand about these datasets and systems. While in-domain model accuracy is rapidly improving on these datasets, generalization suffers when models are evaluated on new domains and datasets. Focusing only on accuracy also obscures other important desiderata, including model interpretability, scalability, and robustness to perturbations. Similarly, the diversity of recent datasets calls for an analysis of the various natural language phenomena (coreference, paraphrasing, entailment, multi-hop reasoning) that these datasets present.
Towards this goal, this year we will seek submissions in two tracks: a research track and a new shared task track. Our shared task is specifically designed to test the generalization abilities of MRQA systems (see more details below).
This track is broad in scope and seeks submissions in areas including, but not limited to:
This track will seek submissions for MRQA models which can generalize to example distributions different from the distribution in the training dataset. A truly effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples — a significantly harder challenge. The task itself will focus on extractive question answering. Given a question and context passage, models must provide an answer as a span in the original document or “No Answer”, meaning the question cannot be answered with the context. Our evaluation dataset will differ in many ways from the training data, including:
We plan to design ten evaluation datasets, which will include existing MRQA datasets, QA datasets modified to fit the shared task paradigm, and reductions of related NLP tasks, such as relation extraction, to question answering. Systems will be ranked by their performance over all datasets. Participants will be given small development datasets for up to five of the test tasks, so that they can evaluate their models. However, no development data will be released for the other test tasks, to prevent participants from devising specialized techniques for the particular set of test tasks we have chosen.
The shared task will have two settings: one in which models can only be trained on provided training data (“Closed”), and one in which any training data may be used (“Open”). For the Closed setting, we will provide a training dataset, consisting of examples pooled from several public datasets. Participants are expected to only use this data for training models. For both settings, participants are allowed and encouraged to use off-the-shelf tools for linguistic annotations (e.g. POS taggers, syntactic parsers), as well as any publicly available unlabeled data and models derived from these (e.g. word vectors, pre-trained language models).
Participants in the shared task will be expected to submit a pre-trained model, which we will run on our hidden test data, as well as a paper describing their system. With this shared task, we hope to guide efforts towards learning robust models and architectures.