Prokrustean Graph: A Substring Index for Rapid K-Mer Size Analysis

The widespread use of k-mers in bioinformatics has enabled efficient genomic sequence analysis across various tasks. However, the impact of k-mer size remains unclear, as complex bioinformatics pipelines obscure its influence with noise. The choice of k is often arbitrary and rarely justified in literature or tutorials. Moreover, methods using multiple k-mer sizes face substantial computational challenges. Most methods rely on well-defined k-mer objects like de Bruijn graphs, Jaccard similarity, Bray-Curtis dissimilarity, and k-mer spectra. The role of k-mer size in these objects is intuitive and described by various metrics. Exploring these objects across k-mer sizes enables robust analyses and new applications. However, how k-mer objects evolve with size remains elusive, and modern substring indices struggle to explore them across sizes. We introduce the Prokrustean graph, a novel substring index that tracks k-mer set transformations across sizes. Built on this index, our framework efficiently computes k-mer-based quantities for all sizes, with complexity independent of the range and dependent only on maximal repeats. For example, counting maximal simple paths in de Bruijn graphs for k=1,⋯,100 takes seconds on a gigabase-scale dataset. We demonstrate this with experiments in pangenomics and metagenomics. The Prokrustean graph is space-efficiently constructed from the Burrows-Wheeler Transform. Our implementation is available at: https://github.com/KoslickiLab/prokrustean. The full version of this manuscript, including construction and four application algorithms, is available at: https://doi.org/10.1101/2023.11.21.568151.

Files

Metadata

Work Title Prokrustean Graph: A Substring Index for Rapid K-Mer Size Analysis
Access
Open Access
Creators
  1. Adam Park
  2. David Koslicki
Keyword
  1. k-mer
  2. k-mer range
  3. variable order k-mers
  4. multi-k analysis
  5. data structure
  6. substring index
  7. BWT
  8. Metagenomics
  9. Pangenomics
  10. genome assembly
  11. sequence analysis
  12. de Bruijn graph
  13. overlap graph
License In Copyright (Rights Reserved)
Work Type Article
Publisher
  1. RECOMB 2025 (Research in Computational Molecular Biology)
Publication Date April 25, 2025
Publisher Identifier (DOI)
  1. https://doi.org/10.1007/978-3-031-90252-9_14
Deposited June 11, 2025

Versions

Analytics

Collections

This resource is currently not in any collection.

Work History

Version 1
published

  • Created
  • Added Prokrustean_RECOMB_2025_Final.pdf
  • Added Creator Adam Park
  • Added Creator David Koslicki
  • Published
  • Updated
  • Updated Keyword, Publisher Show Changes
    Keyword
    • k-mer, k-mer range, variable order k-mers, multi-k analysis, data structure, substring index, BWT, Metagenomics, Pangenomics, genome assembly, sequence analysis, de Bruijn graph, overlap graph
    Publisher
    • RECOMB 2025
    • RECOMB 2025 (Research in Computational Molecular Biology)