The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Final publication is available from Mary Ann Liebert, Inc., publishers https://dx.doi.org/10.1089/cmb.2021.0431

Files

Metadata

Work Title The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches
Access
Open Access
Creators
  1. Antonio Blanca
  2. Robert S. Harris
  3. David Koslicki
  4. Paul Medvedev
License In Copyright (Rights Reserved)
Work Type Article
Publisher
  1. Journal of Computational Biology
Publication Date February 16, 2022
Publisher Identifier (DOI)
  1. https://doi.org/10.1089/cmb.2021.0431
Deposited March 11, 2024

Versions

Analytics

Collections

This resource is currently not in any collection.

Work History

Version 1
published

  • Created
  • Added 2021.01.15.426881v2.full-1.pdf
  • Added Creator Antonio Blanca
  • Added Creator A Blanca Pimentel
  • Added Creator Robert Scott Harris
  • Added Creator David Koslicki
  • Added Creator Paul Medvedev
  • Published
  • Deleted Creator A Blanca Pimentel
  • Renamed Creator Robert S. Harris Show Changes
    • Robert Scott Harris
    • Robert S. Harris
  • Updated Creator David Koslicki
  • Updated Creator Paul Medvedev
  • Updated