What is HoVer?
HoVer is an open-domain, many-hop fact extraction and claim verification dataset built upon the Wikipedia corpus. The original 2-hop claims are adapted from question-answer pairs from HotpotQA. It is collected by a team of NLP researchers at UNC Chapel Hill and Verisk Analytics.
For more details about HoVer, please refer to our EMNLP 2020 paper:
Getting started
HoVer is distributed under a CC BY-SA 4.0 License. The training, development, and test sets can be downloaded below. The test set only has the claims.
To evaluate your model on the test set, submit your model predictions to hover.nlp.dataset@gmail.com. Please Refer to the sample prediction format below.
We use the Wikipedia data processed by the HotpotQA team to construct the dataset. Please use it as the corpus for the document retrieval as other versions of Wikipedia might have different content.
Evaluation
As explained in the Sec 5.5 of the paper, the HoVer score is the percentage of examples where the model must retrieve at least one supporting fact from every supporting document and predict the correct label. However, this doesn't mean submitting a large number of supporting facts can boost up the score. When calculating the HoVer score for a k-hop example, we are only gonna consider supporting facts (sentences) from the first k+1 documents retrieved by your model. For each document considered, we are only gonna evaluate the top-2 selected sentences. So remember to rank your retrieved supporting facts! Please refer to our evaluation script provided below for calculating the supporting-fact F1 score and HoVer score (Coming Soon).Citation
If you use HoVer in your research, please cite our paper with the following BibTeX entry
@inproceedings{jiang2020hover, title={{HoVer}: A Dataset for Many-Hop Fact Extraction And Claim Verification}, author={Yichen Jiang and Shikha Bordia and Zheng Zhong and Charles Dognin and Maneesh Singh and Mohit Bansal.}, booktitle={Findings of the Conference on Empirical Methods in Natural Language Processing ({EMNLP})}, year={2020} }
Model | Code | Fact Extraction | HoVer Score | |||
---|---|---|---|---|---|---|
EM | F1 | Acc | ||||
1 May 24, 2021 |
Baleen Stanford University (Khattab et al., 2021) |
39.78 | 80.41 | 57.53 | ||
2 Oct 13, 2020 |
Baseline Model (single model) UNC Chapel Hill & Verisk Analytics (Jiang, Bordia, et al. 2020) |
4.5 | 49.5 | 15.32 |