What is HoVer?

HoVer is an open-domain, many-hop fact extraction and claim verification dataset built upon the Wikipedia corpus. The original 2-hop claims are adapted from question-answer pairs from HotpotQA. It is collected by a team of NLP researchers at UNC Chapel Hill and Verisk Analytics.

For more details about HoVer, please refer to our EMNLP 2020 paper:

Getting started

HoVer is distributed under a CC BY-SA 4.0 License. The training, development, and test sets can be downloaded below. The test set only has the claims.

To evaluate your model on the test set, submit your model predictions to hover.nlp.dataset@gmail.com. Please Refer to the sample prediction format below.

We use the Wikipedia data processed by the HotpotQA team to construct the dataset. Please use it as the corpus for the document retrieval as other versions of Wikipedia might have different content.


As explained in the Sec 5.5 of the paper, the HoVer score is the percentage of examples where the model must retrieve at least one supporting fact from every supporting document and predict the correct label. However, this doesn't mean submitting a large number of supporting facts can boost up the score. When calculating the HoVer score for a k-hop example, we are only gonna consider supporting facts (sentences) from the first k+1 documents retrieved by your model. For each document considered, we are only gonna evaluate the top-2 selected sentences. So remember to rank your retrieved supporting facts! Please refer to our evaluation script provided below for calculating the supporting-fact F1 score and HoVer score (Coming Soon).


If you use HoVer in your research, please cite our paper with the following BibTeX entry

  title={{HoVer}: A Dataset for Many-Hop Fact Extraction And Claim Verification},
  author={Yichen Jiang and Shikha Bordia and Zheng Zhong and Charles Dognin and Maneesh Singh and Mohit Bansal.},
  booktitle={Findings of the Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
In order to solve a HoVer example, the system must first retrieve the supporting facts from the entire Wikipedia and predict whether the claim is supported or not. The retrieved facts are evaluated against the ground-truth to yield the exact-match and F1 scores. The HoVer score represents the percentage of total examples where at least one supporting fact from every supporting document is retrieved and the correct label is predicted.
Model Code Fact Extraction HoVer Score
EM F1 Acc
Oct 13, 2020
Baseline Model (single model)
UNC Chapel Hill & Verisk Analytics
(Jiang, Bordia, et al. 2020)
4.5 49.5 15.32