The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Final publication is available from Mary Ann Liebert, Inc., publishers



Work Title The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches
Open Access
  1. Antonio Blanca
  2. Robert S. Harris
  3. David Koslicki
  4. Paul Medvedev
License In Copyright (Rights Reserved)
Work Type Article
  1. Journal of Computational Biology
Publication Date February 16, 2022
Publisher Identifier (DOI)
Deposited March 11, 2024




This resource is currently not in any collection.

Work History

Version 1

  • Created
  • Added 2021.01.15.426881v2.full-1.pdf
  • Added Creator Antonio Blanca
  • Added Creator A Blanca Pimentel
  • Added Creator Robert Scott Harris
  • Added Creator David Koslicki
  • Added Creator Paul Medvedev
  • Published
  • Deleted Creator A Blanca Pimentel
  • Renamed Creator Robert S. Harris Show Changes
    • Robert Scott Harris
    • Robert S. Harris
  • Updated Creator David Koslicki
  • Updated Creator Paul Medvedev
  • Updated