Tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces

The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.

Files

Metadata

Work Title Tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces
Access
Open Access
Creators
  1. Lexiang Huang
  2. Timothy Zhu
Keyword
  1. Performance debugging
  2. Distributed systems tracing
License In Copyright (Rights Reserved)
Work Type Conference Proceeding
Publication Date November 1, 2021
Publisher Identifier (DOI)
  1. https://doi.org/10.1145/3472883.3486994
Source
  1. In ACM Symposiumon Cloud Computing (SoCC’21), November 1–4, 2021, Seattle, WA, USA
Deposited August 03, 2023

Versions

Analytics

Collections

This resource is currently not in any collection.

Work History

Version 1
published

  • Created
  • Added socc2021-final204.pdf
  • Added Creator Lexiang Huang
  • Added Creator Timothy Zhu
  • Published
  • Updated Source, Keyword, Subtitle Show Changes
    Source
    • In ACM Symposiumon Cloud Computing (SoCC’21), November 1–4, 2021, Seattle, WA, USA
    Keyword
    • Performance debugging, Distributed systems tracing
    Subtitle
    • Performance profiling via structural aggregation and automated analysis of distributed systems traces
  • Updated