PolicyQA: A Reading Comprehension Dataset for Privacy Policies
Published in Findings of the ACL: EMNLP, 2020
Figure: A pair of passage-question-answer triples from the PolicyQA dataset.
Security and privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span extracted from policy documents better helps them as it reduces the burden of searching the target information from a verbose text segment. In this paper, we present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices. We present two strong neural baselines and rigorous analysis to reveal the advantages and challenges offered by PolicyQA.