Privacy at scale: Introducing the PrivaSeer corpus of web privacy policies

Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks.


  • Privacy_at_Scale.pdf

    size: 316 KB | mime_type: application/pdf | date: 2022-10-05 | sha256: 0e390e5


Work Title Privacy at scale: Introducing the PrivaSeer corpus of web privacy policies
Open Access
  1. Mukund Srinath
  2. Shomir Wilson
  3. C. Lee Giles
License In Copyright (Rights Reserved)
Work Type Article
  1. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing
Publication Date August 1, 2021
Publisher Identifier (DOI)
  1. 10.18653/v1/2021.acl-long.532
Deposited October 05, 2022




This resource is currently not in any collection.

Work History

Version 1

  • Created
  • Added Privacy_at_Scale.pdf
  • Added Creator Mukund Srinath
  • Added Creator Shomir Wilson
  • Added Creator C. Lee Giles
  • Published