Benchmarking — Measuring What Matters 🟡

What you'll learn:

Why naive timing with Instant::now() produces unreliable results

Statistical benchmarking with Criterion.rs and the lighter Divan alternative

Profiling hot spots with perf, flamegraphs, and PGO

Setting up continuous benchmarking in CI to catch regressions automatically

Cross-references: Release Profiles — once you find the hot spot, optimize the binary · CI/CD Pipeline — benchmark job in the pipeline · Code Coverage — coverage tells you what's tested, benchmarks tell you what's fast

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." — Donald Knuth

The hard part isn't writing benchmarks — it's writing benchmarks that produce meaningful, reproducible, actionable numbers. This chapter covers the tools and techniques that get you from "it seems fast" to "we have statistical evidence that PR #347 regressed parsing throughput by 4.2%."

Why Not `std::time::Instant`?

The temptation:

// ❌ Naive benchmarking — unreliable results
use std::time::Instant;

fn main() {
    let start = Instant::now();
    let result = parse_device_query_output(&sample_data);
    let elapsed = start.elapsed();
    println!("Parsing took {:?}", elapsed);
    // Problem 1: Compiler may optimize away `result` (dead code elimination)
    // Problem 2: Single sample — no statistical significance
    // Problem 3: CPU frequency scaling, thermal throttling, other processes
    // Problem 4: Cold cache vs warm cache not controlled
}

Problems with manual timing:

Dead code elimination — the compiler may skip the computation entirely if the result isn't used.
No warm-up — the first run includes cache misses, JIT effects (irrelevant in Rust, but OS page faults apply), and lazy initialization.
No statistical analysis — a single measurement tells you nothing about variance, outliers, or confidence intervals.
No regression detection — you can't compare against previous runs.

Criterion.rs — Statistical Benchmarking

Criterion.rs is the de facto standard for Rust micro-benchmarks. It uses statistical methods to produce reliable measurements and detects performance regressions automatically.

Setup:

# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports", "cargo_bench_support"] }

[[bench]]
name = "parsing_bench"
harness = false  # Use Criterion's harness, not the built-in test harness

A complete benchmark:

// benches/parsing_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

/// Data type for parsed GPU information
#[derive(Debug, Clone)]
struct GpuInfo {
    index: u32,
    name: String,
    temp_c: u32,
    power_w: f64,
}

/// The function under test — simulate parsing device-query CSV output
fn parse_gpu_csv(input: &str) -> Vec<GpuInfo> {
    input
        .lines()
        .filter(|line| !line.starts_with('#'))
        .filter_map(|line| {
            let fields: Vec<&str> = line.split(", ").collect();
            if fields.len() >= 4 {
                Some(GpuInfo {
                    index: fields[0].parse().ok()?,
                    name: fields[1].to_string(),
                    temp_c: fields[2].parse().ok()?,
                    power_w: fields[3].parse().ok()?,
                })
            } else {
                None
            }
        })
        .collect()
}

fn bench_parse_gpu_csv(c: &mut Criterion) {
    // Representative test data
    let small_input = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                       1, Acme Accel-V1-80GB, 34, 67.2\n";

    let large_input = (0..64)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect::<String>();

    c.bench_function("parse_2_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(small_input)))
    });

    c.bench_function("parse_64_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(&large_input)))
    });
}

criterion_group!(benches, bench_parse_gpu_csv);
criterion_main!(benches);

Running and reading results:

# Run all benchmarks
cargo bench

# Run a specific benchmark by name
cargo bench -- parse_64

# Output:
# parse_2_gpus        time:   [1.2345 µs  1.2456 µs  1.2578 µs]
#                      ▲            ▲           ▲
#                      │       confidence interval
#                   lower 95%    median    upper 95%
#
# parse_64_gpus       time:   [38.123 µs  38.456 µs  38.812 µs]
#                     change: [-1.2345% -0.5678% +0.1234%] (p = 0.12 > 0.05)
#                     No change in performance detected.

What black_box() does: It's a compiler hint that prevents dead-code elimination and over-aggressive constant folding. The compiler cannot see through black_box, so it must actually compute the result.

Parameterized Benchmarks and Benchmark Groups

Compare multiple implementations or input sizes:

// benches/comparison_bench.rs
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId, Throughput};

fn bench_parsing_strategies(c: &mut Criterion) {
    let mut group = c.benchmark_group("csv_parsing");

    // Test across different input sizes
    for num_gpus in [1, 8, 32, 64, 128] {
        let input = generate_gpu_csv(num_gpus);

        // Set throughput for bytes-per-second reporting
        group.throughput(Throughput::Bytes(input.len() as u64));

        group.bench_with_input(
            BenchmarkId::new("split_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_split(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("regex_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_regex(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("nom_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_nom(input)),
        );
    }
    group.finish();
}

criterion_group!(benches, bench_parsing_strategies);
criterion_main!(benches);

Output: Criterion generates an HTML report at target/criterion/report/index.html with violin plots, comparison charts, and regression analysis — open in a browser.

Divan — A Lighter Alternative

Divan is a newer benchmarking framework that uses attribute macros instead of Criterion's macro DSL:

# Cargo.toml
[dev-dependencies]
divan = "0.1"

[[bench]]
name = "parsing_bench"
harness = false

// benches/parsing_bench.rs
use divan::black_box;

const SMALL_INPUT: &str = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                          1, Acme Accel-V1-80GB, 34, 67.2\n";

fn generate_gpu_csv(n: usize) -> String {
    (0..n)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect()
}

fn main() {
    divan::main();
}

#[divan::bench]
fn parse_2_gpus() -> Vec<GpuInfo> {
    parse_gpu_csv(black_box(SMALL_INPUT))
}

#[divan::bench(args = [1, 8, 32, 64, 128])]
fn parse_n_gpus(n: usize) -> Vec<GpuInfo> {
    let input = generate_gpu_csv(n);
    parse_gpu_csv(black_box(&input))
}

// Divan output is a clean table:
// ╰─ parse_2_gpus   fastest  │ slowest  │ median   │ mean     │ samples │ iters
//                   1.234 µs │ 1.567 µs │ 1.345 µs │ 1.350 µs │ 100     │ 1600

When to choose Divan over Criterion:

Simpler API (attribute macros, less boilerplate)
Faster compilation (fewer dependencies)
Good for quick perf checks during development

When to choose Criterion:

Statistical regression detection across runs
HTML reports with charts
Established ecosystem, more CI integrations

Profiling with `perf` and Flamegraphs

Benchmarks tell you how fast — profiling tells you where the time goes.

# Step 1: Build with debug info (release speed, debug symbols)
cargo build --release
# Ensure debug info is available:
# [profile.release]
# debug = true          # Add this temporarily for profiling

# Step 2: Record with perf
perf record --call-graph=dwarf ./target/release/diag_tool --run-diagnostics

# Step 3: Generate a flamegraph
# Install: cargo install flamegraph
# Install: cargo install addr2line --features=bin (optional, speedup cargo-flamegraph)
cargo flamegraph --root -- --run-diagnostics
# Opens an interactive SVG flamegraph

# Alternative: use perf + inferno
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

Reading a flamegraph:

Width = time spent in that function (wider = slower)
Height = call stack depth (taller ≠ slower, just deeper)
Bottom = entry point, Top = leaf functions doing actual work
Look for wide plateaus at the top — those are your hot spots

Profile-Guided Optimization (PGO)

Profile-Guided Optimization (PGO) is a compiler optimization technique for improving performance of CPU-intensive applications. The basic concept of PGO is to collect data about the typical execution of a program (e.g. which branches it is likely to take) and then use this data to inform optimizations such as inlining, machine-code layout, register allocation, etc.

There are different ways of collecting data about a program’s execution. One is to run the program inside a profiler (such as perf) and another is to create an instrumented binary, that is, a binary that has data collection built into it, and run that. The latter usually provides more accurate data and it is also what is supported by Rustc.

Below there is an example of instrumentation-based PGO:

# Step 1: Build with instrumentation
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" cargo build --release

# Step 2: Run representative workloads
./target/release/diag_tool --run-full   # generates profiling data

# Step 3: Merge profiling data
# Use the llvm-profdata that matches rustc's LLVM version:
# $(rustc --print sysroot)/lib/rustlib/x86_64-unknown-linux-gnu/bin/llvm-profdata
# Or if llvm-tools is installed: rustup component add llvm-tools
llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data/

# Step 4: Rebuild with profiling feedback
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" cargo build --release
# Typical improvement: 5-20% for compute-bound code (parsing, crypto, codegen).
# I/O-bound or syscall-heavy code (like a large project) will see much less benefit
# because the CPU is mostly waiting, not executing hot loops.

As an alternative to directly using the compiler for PGO, you may choose to go with cargo-pgo, which has an intuitive command-line API and saves you the trouble of doing all the manual work.

With cargo-pgo, the optimization workflow from above can look like that:

# Step 1: Build with instrumentation
cargo pgo build

# Step 2: Run representative workloads
cargo pgo run -- --run-full

# Step 3: Rebuild with profiling feedback
cargo pgo optimize

Sampling PGO or SPGO is a more complicated way to perform PGO in a price of reduced runtime overhead compared to instrumentation-based PGO. For now, the best place to read about it is the Clang PGO manual.

Tip: Before spending time on PGO, ensure your release profile already has LTO enabled — it typically delivers a bigger win for less effort.

`hyperfine` — Quick End-to-End Timing

hyperfine benchmarks entire commands, not individual functions. It's perfect for measuring overall binary performance:

# Install
cargo install hyperfine
# Or: sudo apt install hyperfine  (Ubuntu 23.04+)

# Basic benchmark
hyperfine './target/release/diag_tool --run-diagnostics'

# Compare two implementations
hyperfine './target/release/diag_tool_v1 --run-diagnostics' \
          './target/release/diag_tool_v2 --run-diagnostics'

# Warm-up runs + minimum iterations
hyperfine --warmup 3 --min-runs 10 './target/release/diag_tool --run-all'

# Export results as JSON for CI comparison
hyperfine --export-json bench.json './target/release/diag_tool --run-all'

When to use hyperfine vs Criterion:

hyperfine: whole-binary timing, comparing before/after a refactor, I/O-bound workloads
Criterion: micro-benchmarks of individual functions, statistical regression detection

Continuous Benchmarking in CI

Detect performance regressions before they ship:

# .github/workflows/bench.yml
name: Benchmarks

on:
  pull_request:
    paths: ['**/*.rs', 'Cargo.toml', 'Cargo.lock']

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: dtolnay/rust-toolchain@stable

      - name: Run benchmarks
        # Requires criterion = { features = ["cargo_bench_support"] } for --output-format
        run: cargo bench -- --output-format bencher | tee bench_output.txt

      - name: Store benchmark result
        uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: 'cargo'
          output-file-path: bench_output.txt
          github-token: ${{ secrets.GITHUB_TOKEN }}
          auto-push: true
          alert-threshold: '120%'    # Alert if 20% slower
          comment-on-alert: true
          fail-on-alert: true        # Block PR if regression detected

Key CI considerations:

Use dedicated benchmark runners (not shared CI) for consistent results
Pin the runner to a specific machine type if using cloud CI
Store historical data to detect gradual regressions
Set thresholds based on your workload's tolerance (5% for hot paths, 20% for cold)

Application: Parsing Performance

The project has several performance-sensitive parsing paths that would benefit from benchmarks:

Parsing Hot Spot	Crate	Why It Matters
accelerator-query CSV/XML output	`device_diag`	Called per-GPU, up to 8× per run
Sensor event parsing	`event_log`	Thousands of records on busy servers
PCIe topology JSON	`topology_lib`	Complex nested structures, golden-file validated
Report JSON serialization	`diag_framework`	Final report output, size-sensitive
Config JSON loading	`config_loader`	Startup latency

Recommended first benchmark — the topology parser, which already has golden-file test data:

// topology_lib/benches/parse_bench.rs (proposed)
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
use std::fs;

fn bench_topology_parse(c: &mut Criterion) {
    let mut group = c.benchmark_group("topology_parse");

    for golden_file in ["S2001", "S1015", "S1035", "S1080"] {
        let path = format!("tests/test_data/{golden_file}.json");
        let data = fs::read_to_string(&path).expect("golden file not found");
        group.throughput(Throughput::Bytes(data.len() as u64));

        group.bench_function(golden_file, |b| {
            b.iter(|| {
                topology_lib::TopologyProfile::from_json_str(
                    criterion::black_box(&data)
                )
            });
        });
    }
    group.finish();
}

criterion_group!(benches, bench_topology_parse);
criterion_main!(benches);

Try It Yourself

Write a Criterion benchmark: Pick any parsing function in your codebase. Create a benches/ directory, set up a Criterion benchmark that measures throughput in bytes/second. Run cargo bench and examine the HTML report.
Generate a flamegraph: Build your project with debug = true in [profile.release], then run cargo flamegraph -- <your-args>. Identify the three widest stacks at the top of the flamegraph — those are your hot spots.
Compare with hyperfine: Install hyperfine and benchmark the overall execution time of your binary with different flags. Compare it to the per-function times from Criterion. Where does the time go that Criterion doesn't see? (Answer: I/O, syscalls, process startup.)

Benchmark Tool Selection

Loading diagram...

🏋️ Exercises

🟢 Exercise 1: First Criterion Benchmark

Create a crate with a function that sorts a Vec<u64> of 10,000 random elements. Write a Criterion benchmark for it, then switch to .sort_unstable() and observe the performance difference in the HTML report.

Solution

# Cargo.toml
[[bench]]
name = "sort_bench"
harness = false

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
rand = "0.8"

// benches/sort_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::Rng;

fn generate_data(n: usize) -> Vec<u64> {
    let mut rng = rand::thread_rng();
    (0..n).map(|_| rng.gen()).collect()
}

fn bench_sort(c: &mut Criterion) {
    let mut group = c.benchmark_group("sort-10k");

    group.bench_function("stable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });

    group.bench_function("unstable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort_unstable(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });

    group.finish();
}

criterion_group!(benches, bench_sort);
criterion_main!(benches);

cargo bench
open target/criterion/sort-10k/report/index.html

🟡 Exercise 2: Flamegraph Hot Spot

Build a project with debug = true in [profile.release], then generate a flamegraph. Identify the top 3 widest stacks.

Solution

# Cargo.toml
[profile.release]
debug = true  # Keep symbols for flamegraph

cargo install flamegraph
cargo flamegraph --release -- <your-args>
# Opens flamegraph.svg in browser
# The widest stacks at the top are your hot spots

Key Takeaways

Never benchmark with Instant::now() — use Criterion.rs for statistical rigor and regression detection
black_box() prevents the compiler from optimizing away your benchmark target
hyperfine measures wall-clock time for the whole binary; Criterion measures individual functions — use both
Flamegraphs show where time is spent; benchmarks show how much time is spent
Continuous benchmarking in CI catches performance regressions before they ship

Benchmarking — Measuring What Matters 🟡

What you'll learn:

Why naive timing with Instant::now() produces unreliable results

Statistical benchmarking with Criterion.rs and the lighter Divan alternative

Profiling hot spots with perf, flamegraphs, and PGO

Setting up continuous benchmarking in CI to catch regressions automatically

Cross-references: Release Profiles — once you find the hot spot, optimize the binary · CI/CD Pipeline — benchmark job in the pipeline · Code Coverage — coverage tells you what's tested, benchmarks tell you what's fast

Why Not `std::time::Instant`?

The temptation:

// ❌ Naive benchmarking — unreliable results
use std::time::Instant;

fn main() {
    let start = Instant::now();
    let result = parse_device_query_output(&sample_data);
    let elapsed = start.elapsed();
    println!("Parsing took {:?}", elapsed);
    // Problem 1: Compiler may optimize away `result` (dead code elimination)
    // Problem 2: Single sample — no statistical significance
    // Problem 3: CPU frequency scaling, thermal throttling, other processes
    // Problem 4: Cold cache vs warm cache not controlled
}

Problems with manual timing:

Dead code elimination — the compiler may skip the computation entirely if the result isn't used.
No warm-up — the first run includes cache misses, JIT effects (irrelevant in Rust, but OS page faults apply), and lazy initialization.
No statistical analysis — a single measurement tells you nothing about variance, outliers, or confidence intervals.
No regression detection — you can't compare against previous runs.

Criterion.rs — Statistical Benchmarking

Criterion.rs is the de facto standard for Rust micro-benchmarks. It uses statistical methods to produce reliable measurements and detects performance regressions automatically.

Setup:

# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports", "cargo_bench_support"] }

[[bench]]
name = "parsing_bench"
harness = false  # Use Criterion's harness, not the built-in test harness

A complete benchmark:

// benches/parsing_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

/// Data type for parsed GPU information
#[derive(Debug, Clone)]
struct GpuInfo {
    index: u32,
    name: String,
    temp_c: u32,
    power_w: f64,
}

/// The function under test — simulate parsing device-query CSV output
fn parse_gpu_csv(input: &str) -> Vec<GpuInfo> {
    input
        .lines()
        .filter(|line| !line.starts_with('#'))
        .filter_map(|line| {
            let fields: Vec<&str> = line.split(", ").collect();
            if fields.len() >= 4 {
                Some(GpuInfo {
                    index: fields[0].parse().ok()?,
                    name: fields[1].to_string(),
                    temp_c: fields[2].parse().ok()?,
                    power_w: fields[3].parse().ok()?,
                })
            } else {
                None
            }
        })
        .collect()
}

fn bench_parse_gpu_csv(c: &mut Criterion) {
    // Representative test data
    let small_input = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                       1, Acme Accel-V1-80GB, 34, 67.2\n";

    let large_input = (0..64)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect::<String>();

    c.bench_function("parse_2_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(small_input)))
    });

    c.bench_function("parse_64_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(&large_input)))
    });
}

criterion_group!(benches, bench_parse_gpu_csv);
criterion_main!(benches);

Running and reading results:

# Run all benchmarks
cargo bench

# Run a specific benchmark by name
cargo bench -- parse_64

# Output:
# parse_2_gpus        time:   [1.2345 µs  1.2456 µs  1.2578 µs]
#                      ▲            ▲           ▲
#                      │       confidence interval
#                   lower 95%    median    upper 95%
#
# parse_64_gpus       time:   [38.123 µs  38.456 µs  38.812 µs]
#                     change: [-1.2345% -0.5678% +0.1234%] (p = 0.12 > 0.05)
#                     No change in performance detected.

Parameterized Benchmarks and Benchmark Groups

Compare multiple implementations or input sizes:

// benches/comparison_bench.rs
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId, Throughput};

fn bench_parsing_strategies(c: &mut Criterion) {
    let mut group = c.benchmark_group("csv_parsing");

    // Test across different input sizes
    for num_gpus in [1, 8, 32, 64, 128] {
        let input = generate_gpu_csv(num_gpus);

        // Set throughput for bytes-per-second reporting
        group.throughput(Throughput::Bytes(input.len() as u64));

        group.bench_with_input(
            BenchmarkId::new("split_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_split(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("regex_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_regex(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("nom_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_nom(input)),
        );
    }
    group.finish();
}

criterion_group!(benches, bench_parsing_strategies);
criterion_main!(benches);

Output: Criterion generates an HTML report at target/criterion/report/index.html with violin plots, comparison charts, and regression analysis — open in a browser.

Divan — A Lighter Alternative

Divan is a newer benchmarking framework that uses attribute macros instead of Criterion's macro DSL:

# Cargo.toml
[dev-dependencies]
divan = "0.1"

[[bench]]
name = "parsing_bench"
harness = false

// benches/parsing_bench.rs
use divan::black_box;

const SMALL_INPUT: &str = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                          1, Acme Accel-V1-80GB, 34, 67.2\n";

fn generate_gpu_csv(n: usize) -> String {
    (0..n)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect()
}

fn main() {
    divan::main();
}

#[divan::bench]
fn parse_2_gpus() -> Vec<GpuInfo> {
    parse_gpu_csv(black_box(SMALL_INPUT))
}

#[divan::bench(args = [1, 8, 32, 64, 128])]
fn parse_n_gpus(n: usize) -> Vec<GpuInfo> {
    let input = generate_gpu_csv(n);
    parse_gpu_csv(black_box(&input))
}

// Divan output is a clean table:
// ╰─ parse_2_gpus   fastest  │ slowest  │ median   │ mean     │ samples │ iters
//                   1.234 µs │ 1.567 µs │ 1.345 µs │ 1.350 µs │ 100     │ 1600

When to choose Divan over Criterion:

Simpler API (attribute macros, less boilerplate)
Faster compilation (fewer dependencies)
Good for quick perf checks during development

When to choose Criterion:

Statistical regression detection across runs
HTML reports with charts
Established ecosystem, more CI integrations

Profiling with `perf` and Flamegraphs

Benchmarks tell you how fast — profiling tells you where the time goes.

# Step 1: Build with debug info (release speed, debug symbols)
cargo build --release
# Ensure debug info is available:
# [profile.release]
# debug = true          # Add this temporarily for profiling

# Step 2: Record with perf
perf record --call-graph=dwarf ./target/release/diag_tool --run-diagnostics

# Step 3: Generate a flamegraph
# Install: cargo install flamegraph
# Install: cargo install addr2line --features=bin (optional, speedup cargo-flamegraph)
cargo flamegraph --root -- --run-diagnostics
# Opens an interactive SVG flamegraph

# Alternative: use perf + inferno
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

Reading a flamegraph:

Width = time spent in that function (wider = slower)
Height = call stack depth (taller ≠ slower, just deeper)
Bottom = entry point, Top = leaf functions doing actual work
Look for wide plateaus at the top — those are your hot spots

Profile-Guided Optimization (PGO)

Below there is an example of instrumentation-based PGO:

# Step 1: Build with instrumentation
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" cargo build --release

# Step 2: Run representative workloads
./target/release/diag_tool --run-full   # generates profiling data

# Step 3: Merge profiling data
# Use the llvm-profdata that matches rustc's LLVM version:
# $(rustc --print sysroot)/lib/rustlib/x86_64-unknown-linux-gnu/bin/llvm-profdata
# Or if llvm-tools is installed: rustup component add llvm-tools
llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data/

# Step 4: Rebuild with profiling feedback
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" cargo build --release
# Typical improvement: 5-20% for compute-bound code (parsing, crypto, codegen).
# I/O-bound or syscall-heavy code (like a large project) will see much less benefit
# because the CPU is mostly waiting, not executing hot loops.

As an alternative to directly using the compiler for PGO, you may choose to go with cargo-pgo, which has an intuitive command-line API and saves you the trouble of doing all the manual work.

With cargo-pgo, the optimization workflow from above can look like that:

# Step 1: Build with instrumentation
cargo pgo build

# Step 2: Run representative workloads
cargo pgo run -- --run-full

# Step 3: Rebuild with profiling feedback
cargo pgo optimize

Tip: Before spending time on PGO, ensure your release profile already has LTO enabled — it typically delivers a bigger win for less effort.

`hyperfine` — Quick End-to-End Timing

hyperfine benchmarks entire commands, not individual functions. It's perfect for measuring overall binary performance:

# Install
cargo install hyperfine
# Or: sudo apt install hyperfine  (Ubuntu 23.04+)

# Basic benchmark
hyperfine './target/release/diag_tool --run-diagnostics'

# Compare two implementations
hyperfine './target/release/diag_tool_v1 --run-diagnostics' \
          './target/release/diag_tool_v2 --run-diagnostics'

# Warm-up runs + minimum iterations
hyperfine --warmup 3 --min-runs 10 './target/release/diag_tool --run-all'

# Export results as JSON for CI comparison
hyperfine --export-json bench.json './target/release/diag_tool --run-all'

When to use hyperfine vs Criterion:

hyperfine: whole-binary timing, comparing before/after a refactor, I/O-bound workloads
Criterion: micro-benchmarks of individual functions, statistical regression detection

Continuous Benchmarking in CI

Detect performance regressions before they ship:

# .github/workflows/bench.yml
name: Benchmarks

on:
  pull_request:
    paths: ['**/*.rs', 'Cargo.toml', 'Cargo.lock']

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: dtolnay/rust-toolchain@stable

      - name: Run benchmarks
        # Requires criterion = { features = ["cargo_bench_support"] } for --output-format
        run: cargo bench -- --output-format bencher | tee bench_output.txt

      - name: Store benchmark result
        uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: 'cargo'
          output-file-path: bench_output.txt
          github-token: ${{ secrets.GITHUB_TOKEN }}
          auto-push: true
          alert-threshold: '120%'    # Alert if 20% slower
          comment-on-alert: true
          fail-on-alert: true        # Block PR if regression detected

Key CI considerations:

Use dedicated benchmark runners (not shared CI) for consistent results
Pin the runner to a specific machine type if using cloud CI
Store historical data to detect gradual regressions
Set thresholds based on your workload's tolerance (5% for hot paths, 20% for cold)

Application: Parsing Performance

The project has several performance-sensitive parsing paths that would benefit from benchmarks:

Parsing Hot Spot	Crate	Why It Matters
accelerator-query CSV/XML output	`device_diag`	Called per-GPU, up to 8× per run
Sensor event parsing	`event_log`	Thousands of records on busy servers
PCIe topology JSON	`topology_lib`	Complex nested structures, golden-file validated
Report JSON serialization	`diag_framework`	Final report output, size-sensitive
Config JSON loading	`config_loader`	Startup latency

Recommended first benchmark — the topology parser, which already has golden-file test data:

// topology_lib/benches/parse_bench.rs (proposed)
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
use std::fs;

fn bench_topology_parse(c: &mut Criterion) {
    let mut group = c.benchmark_group("topology_parse");

    for golden_file in ["S2001", "S1015", "S1035", "S1080"] {
        let path = format!("tests/test_data/{golden_file}.json");
        let data = fs::read_to_string(&path).expect("golden file not found");
        group.throughput(Throughput::Bytes(data.len() as u64));

        group.bench_function(golden_file, |b| {
            b.iter(|| {
                topology_lib::TopologyProfile::from_json_str(
                    criterion::black_box(&data)
                )
            });
        });
    }
    group.finish();
}

criterion_group!(benches, bench_topology_parse);
criterion_main!(benches);

Try It Yourself

Write a Criterion benchmark: Pick any parsing function in your codebase. Create a benches/ directory, set up a Criterion benchmark that measures throughput in bytes/second. Run cargo bench and examine the HTML report.
Generate a flamegraph: Build your project with debug = true in [profile.release], then run cargo flamegraph -- <your-args>. Identify the three widest stacks at the top of the flamegraph — those are your hot spots.
Compare with hyperfine: Install hyperfine and benchmark the overall execution time of your binary with different flags. Compare it to the per-function times from Criterion. Where does the time go that Criterion doesn't see? (Answer: I/O, syscalls, process startup.)

# Cargo.toml
[[bench]]
name = "sort_bench"
harness = false

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
rand = "0.8"

// benches/sort_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::Rng;

fn generate_data(n: usize) -> Vec<u64> {
    let mut rng = rand::thread_rng();
    (0..n).map(|_| rng.gen()).collect()
}

fn bench_sort(c: &mut Criterion) {
    let mut group = c.benchmark_group("sort-10k");

    group.bench_function("stable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });

    group.bench_function("unstable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort_unstable(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });

    group.finish();
}

criterion_group!(benches, bench_sort);
criterion_main!(benches);

cargo bench
open target/criterion/sort-10k/report/index.html

🟡 Exercise 2: Flamegraph Hot Spot

Build a project with debug = true in [profile.release], then generate a flamegraph. Identify the top 3 widest stacks.

Solution

# Cargo.toml
[profile.release]
debug = true  # Keep symbols for flamegraph

cargo install flamegraph
cargo flamegraph --release -- <your-args>
# Opens flamegraph.svg in browser
# The widest stacks at the top are your hot spots

Key Takeaways

Never benchmark with Instant::now() — use Criterion.rs for statistical rigor and regression detection
black_box() prevents the compiler from optimizing away your benchmark target
hyperfine measures wall-clock time for the whole binary; Criterion measures individual functions — use both
Flamegraphs show where time is spent; benchmarks show how much time is spent
Continuous benchmarking in CI catches performance regressions before they ship