CMash: Fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

Motivation: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.

Results: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure.

Availability and implementation: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles.

This is a pre-copyedited, author-produced PDF of an article accepted for publication in Bioinformatics following peer review. The version of record [CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices. Bioinformatics 38, Supplement_1 pi28-i35 (2022)] is available online at: https://doi.org/10.1093/bioinformatics/btac237.

Files

Metadata

Work Title CMash: Fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
Access
Open Access
Creators
  1. Shaopeng Liu
  2. David Koslicki
Keyword
  1. MinHash
  2. K-mer
  3. Truncation
  4. Jaccard index
  5. Containment index
License In Copyright (Rights Reserved)
Work Type Article
Publisher
  1. Bioinformatics
Publication Date June 27, 2022
Publisher Identifier (DOI)
  1. https://doi.org/10.1093/bioinformatics/btac237
Deposited February 03, 2024

Versions

Analytics

Collections

This resource is currently not in any collection.

Work History

Version 1
published

  • Created
  • Added 2021.12.06.471436v2.full-1.pdf
  • Added Creator Shaopeng Liu
  • Added Creator David Koslicki
  • Published
  • Updated Keyword, Description, Publication Date Show Changes
    Keyword
    • MinHash, K-mer, Truncation, Jaccard index, Containment index
    Description
    • Motivation: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. Results: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure.
    • Motivation: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.
    • Results: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure.
    • Availability and implementation: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles.
    Publication Date
    • 2022-06-22
    • 2022-06-27
  • Updated