Workshop at EMNLP-IJCNLP 2019
The 2019 MRQA Shared Task focuses on generalization. We release an official training dataset containing examples from existing QA datasets, and evaluate submitted models on ten hidden QA test datasets. Train and test datasets may differ in some of the following ways:
Both train and test datasets have the same format and this year we focus on extractive question answering. That is, given a question and context passage, systems must find a segment of text, or span in the document that best answers the question. While this format is somewhat restrictive, it allows us to leverage many existing datasets, and its simplicity helps us focus on out-of-domain generalization, instead of other important but orthogonal challenges.
Each participant will submit a single QA system trained on the provided training data. We will then privately evaluate each system on the hidden test data.
All participants are required to use our official training corpus (see our GitHub repository for details), which consists of examples pooled from the following datasets:
No other question answering data may be used for training. We allow and encourage participants to use off-the-shelf tools for linguistic annotation (e.g. POS taggers, syntactic parsers), as well as any publicly available unlabeled data and models derived from these (e.g. word vectors, pre-trained language models).
For development, we release development datasets for six out of the ten test datasets (out-of-domain):
In addition, we also provide “in-domain” dev datasets to be used for helping devlop models. The final testing, however, will only contain out-of-domain data.
We will keep the other five test datasets hidden until the conclusion of the shared task. We hope this will prevent teams from building solutions that are specific to our test datasets, but do not generalize to other datasets.
Note: while the development data can be used for model selection, participants should not train models directly on the development data.
Systems will first be evaluated using automatic metrics: exact match score (EM) and word-level F1-score (F1). EM only gives credit for predictions that exactly match the gold answer(s), whereas F1 gives partial credit for partial word overlap with the gold answer(s). We will judge systems primarily on their (macro-) average F1 score across all test datasets.
Time and resources permitting, we plan to run human evaluation on the top few systems with the highest overall score. Human evaluators will directly judge whether top systems’ predictions are good answers to the test questions.
After models have been submitted, we will release anonymized, interactive web demos for high-performing models. Anyone will be able to pose their own questions to these models, in order to better understand their strengths and weaknesses. We will report on these findings at the workshop.
We detail data format and submission instructions, along with our baseline models, in this GitHub repository. For any inquiry about the shared task and the submission, please make a new issue in the repository.
Please register your team through this form.
All submission deadlines are 11:59 PM GMT-12 (anywhere in the world) unless otherwise noted.
For any questions regarding our shared task, please use Github issues. We are here to answer your questions and looking forward to your submissions!