Prokrustean Graph: A Substring Index for Rapid K-Mer Size Analysis
The widespread use of k-mers in bioinformatics has enabled efficient genomic sequence analysis across various tasks. However, the impact of k-mer size remains unclear, as complex bioinformatics pipelines obscure its influence with noise. The choice of k is often arbitrary and rarely justified in literature or tutorials. Moreover, methods using multiple k-mer sizes face substantial computational challenges. Most methods rely on well-defined k-mer objects like de Bruijn graphs, Jaccard similarity, Bray-Curtis dissimilarity, and k-mer spectra. The role of k-mer size in these objects is intuitive and described by various metrics. Exploring these objects across k-mer sizes enables robust analyses and new applications. However, how k-mer objects evolve with size remains elusive, and modern substring indices struggle to explore them across sizes. We introduce the Prokrustean graph, a novel substring index that tracks k-mer set transformations across sizes. Built on this index, our framework efficiently computes k-mer-based quantities for all sizes, with complexity independent of the range and dependent only on maximal repeats. For example, counting maximal simple paths in de Bruijn graphs for k=1,⋯,100 takes seconds on a gigabase-scale dataset. We demonstrate this with experiments in pangenomics and metagenomics. The Prokrustean graph is space-efficiently constructed from the Burrows-Wheeler Transform. Our implementation is available at: https://github.com/KoslickiLab/prokrustean. The full version of this manuscript, including construction and four application algorithms, is available at: https://doi.org/10.1101/2023.11.21.568151.
Files
Metadata
Work Title | Prokrustean Graph: A Substring Index for Rapid K-Mer Size Analysis |
---|---|
Access | |
Creators |
|
Keyword |
|
License | In Copyright (Rights Reserved) |
Work Type | Article |
Publisher |
|
Publication Date | April 25, 2025 |
Publisher Identifier (DOI) |
|
Deposited | June 11, 2025 |
Versions
Analytics
Collections
This resource is currently not in any collection.