Profiling

This chapter records concrete profiling runs on rsconstruct, with methodology and findings pinned to a specific version. Add new runs as new sections with date + version headers so historical data stays intact.

How to profile locally

Build a profile-friendly binary

The default release profile strips symbols, so stack traces come out as raw addresses. Cargo.toml defines a profiling profile that inherits release but keeps full debug info:

[profile.profiling]
inherits = "release"
strip = false
debug = true

Build with:

cargo build --profile profiling
# binary lands in target/profiling/rsconstruct

Prerequisite: relax `perf_event_paranoid`

Kernel sampling (perf, samply) requires kernel.perf_event_paranoid <= 1. On a personal dev machine, persist it:

echo 'kernel.perf_event_paranoid = 1' | sudo tee /etc/sysctl.d/60-perf.conf
sudo sysctl --system

Record with `perf` (text-pipeline-friendly)

On CPUs without LBR (most laptops), DWARF unwinding is very slow to post-process — don’t use --call-graph dwarf unless you’re patient. Without a call graph you still get reliable self-time attribution:

perf record -F 999 -o /tmp/rsc.perf.data -- \
    target/profiling/rsconstruct --quiet --color=never status

perf report -i /tmp/rsc.perf.data --stdio --no-children \
    --sort symbol --percent-limit 0.1

Alternative: `samply` (Firefox-Profiler UI)

cargo install samply
samply record -r 4000 -o /tmp/rsc.json.gz -- \
    target/profiling/rsconstruct --quiet --color=never status

Default behavior opens a local UI. Use --save-only to just write the file.

Hardware counters

perf stat -d -- target/profiling/rsconstruct --quiet --color=never status

Gives IPC, cache miss rates, branch miss rates — useful for “is this CPU-bound, memory-bound, or branch-mispredict-bound.”

Run: 2026-04-12 — rsconstruct 0.8.1 — `status` on `teaching-slides`

Target

Command: rsconstruct --quiet --color=never status
Project: ../teaching-slides (10,027 products across 10 processors).
Product breakdown: explicit (1), ipdfunite (55), markdownlint (824), marp (824), ruff (19), script.check_md (824), script.check_svg (3327), svglint (3327), tera (2), zspell (824).

Methodology

Binary: target/profiling/rsconstruct (release + debug info).
Sampler: perf record -F 999 (no call-graph — LBR unavailable, DWARF too slow to post-process on this host).
Counters: perf stat -d.

Wall-clock and counters

Metric	Value
Wall time	1.08 s
User time	0.99 s
System time	0.08 s
CPU utilization	98.7 % of 1 core
RSS peak	28 MB
Instructions	21.10 B
Cycles	5.30 B
IPC	3.98 (very high)
Frontend stall	12.8 %
Branches	5.11 B
Branch miss rate	0.60 %
L1-dcache loads	7.03 B
L1-dcache miss rate	4.13 %

Interpretation: high IPC, low miss rates, low branch mispredictions. The CPU pipeline is fully utilized — slowness comes from doing too many instructions, not from cache thrash or branch mispredicts.

Hot spots (self-time)

% of CPU	Function
48.79 %	`std::path::Components::parse_next_component_back`
12.90 %	`<std::path::Components as DoubleEndedIterator>::next_back`
10.84 %	`rsconstruct::graph::BuildGraph::add_product_with_variant`
8.43 %	`<std::path::Components as PartialEq>::eq`
1.41 %	`__memcmp_evex_movbe`
1.04 %	`core::str::converts::from_utf8`
0.89 %	`_int_malloc`
0.78 %	`std::fs::DirEntry::file_type`
0.61 %	`<std::path::Path as Hash>::hash`
0.60 %	`<std::path::Components as Iterator>::next`
0.38 %	`std::sys::fs::metadata`
0.38 %	`<sip::Hasher as Hasher>::write`
0.37 %	`sha2::sha256::x86::digest_blocks`
0.34 %	`<core::str::lossy::Utf8Chunks as Iterator>::next`
0.31 %	`_int_realloc`
0.29 %	`_int_free_chunk`
0.19 %	`rsconstruct::graph::Product::cache_key`
0.19 %	`std::path::compare_components`
0.19 %	`serde_json::read::SliceRead::parse_str`
0.19 %	`statx`
0.19 %	`malloc`
0.19 %	`cfree`
0.18 %	`core::hash::BuildHasher::hash_one`
rest	scattered < 0.15 % each

Findings

~70 % of CPU is in PathBuf iteration / comparison. Specifically parse_next_component_back + next_back + Components::eq, all invoked from PathBuf equality and hashing. Filesystem I/O (readdir, stat, open) is under 2 %. Hashing (SHA-256 + SipHash) is under 1 %.

The callsite is BuildGraph::add_product_with_variant in src/graph.rs (lines 221–307). It contains three loops whose path-equality cost dominates the whole run:

Lines 232–242 — checker dedup loop. For every checker product (outputs empty), scans every existing product and compares existing.inputs[0] == inputs[0] (full PathBuf equality, which iterates components). With 7,000+ checker products in teaching-slides (script.check_md + script.check_svg + svglint + markdownlint + zspell), this is an O(P²) pass per processor over the course of discovery.
Lines 252–253 — superset check for generator re-declarations. Includes existing.inputs.iter().all(|i| inputs.contains(i)) — an O(M²) call, again per-insertion, again comparing PathBufs component-by-component.
Lines 246–285 — output conflict check. Fast path (HashMap lookup); not the bottleneck.

Graph mutation itself (add_product_with_variant self-time, 10.84 %) is modest. The quadratic scans inside it are where the time goes — they just happen to be attributed to the stdlib path-iteration functions.

Suggested fix (not yet implemented)

Index the checker-dedup and generator-superset lookups via a HashMap keyed on (processor, primary_input, variant) so the linear scans become O(1). For 10,027 products, the expected improvement is ~3×–5× on status wall time.

Scope: additions to BuildGraph (two new HashMap index fields, kept in sync with add_product_*), a small change to add_product_with_variant to do HashMap lookups instead of linear scans. No cache-layout or on-disk-format changes.

Raw data

/tmp/rsc.perf.data was recorded and analyzed to produce the tables above. Removed afterwards — regenerate via the methodology section if needed.

Run: 2026-04-12 (later) — HEAD after HashMap dedup fix

Wall-clock and counters

Metric	Value	vs. 0.8.1 tag
Wall time	0.265 s	4.1× faster
Instructions	2.05 B	-90 %
Cycles	0.88 B	-83 %
IPC	2.34	was 3.98
L1-dcache miss rate	1.34 %	was 4.13 %

The quadratic path-equality peak is gone. What remains is the normal cost of using PathBuf as HashMap keys.

Hot spots (self-time, user-space, 9,948 samples, 10 iterations)

%	Function	Category
5.42	`core::str::converts::from_utf8`	UTF-8 validation
3.52	`sip::Hasher::write`	HashMap hashing
3.51	`<Path as Hash>::hash`	HashMap hashing
3.39	`sha2::sha256::digest_blocks`	Checksumming
2.09	`Components::next`	Path iteration
2.00	`_int_malloc`	Allocator
1.92	`parse_next_component_back`	Path iteration
1.60	`compare_components`	Path comparison
1.19	`combined_input_checksum`	Checksumming
1.12	`Product::cache_key`	Cache keys

Run: 2026-04-12 (later still) — HEAD after path interning

Context

BuildGraph’s three hot HashMaps (output_to_product, input_to_products, checker_dedup) switched from PathBuf keys to a private PathId(u32) backed by an in-memory PathInterner. See Path Interning for design.

Wall-clock and counters

Metric	Value	vs. previous
Wall time	0.245 s	-8 %
Instructions	2.04 B	~flat
Cycles	0.91 B	~flat

Hot spots (self-time, user-space, 9,925 samples, 10 iterations)

%	Function	Notes
4.34	`core::str::converts::from_utf8`	unchanged
3.21	`sha2::sha256::digest_blocks`	unchanged
2.60	`Components::next`	unchanged
2.39	`sip::Hasher::write`	down from 3.52 %
2.25	`<Path as Hash>::hash`	down from 3.51 %
2.06	`_int_malloc`	unchanged
1.52	`resolve_dependencies`	new — attribution shift
1.19	`compare_components`	down from 1.60 %
1.09	`combined_input_checksum`	unchanged

Interning paid off exactly where predicted — the hashing/compare columns dropped, and resolve_dependencies appears because its inner loop is now small enough to self-attribute rather than vanish inside the stdlib path functions. The total gain is modest (~8 %) because after the HashMap dedup fix, HashMap key cost was only ~7 % of total, and interning cuts that in half.

Candidate next targets (not yet implemented)

UTF-8 validation (~6 %) — from display().to_string() in cache-key building. Cache the string form per product or build keys from raw bytes.
Product::cache_key + hex encoding (~2 % combined) — precompute and memoize per product.
SHA-256 (~3 %) — already hardware-accelerated; the only lever is fewer calls, via memoized input_checksum or better batch reuse.

RSConstruct - Rust Build Tool

Profiling

How to profile locally

Build a profile-friendly binary

Prerequisite: relax `perf_event_paranoid`

Record with `perf` (text-pipeline-friendly)

Alternative: `samply` (Firefox-Profiler UI)

Hardware counters

Run: 2026-04-12 — rsconstruct 0.8.1 — `status` on `teaching-slides`

Target

Methodology

Wall-clock and counters

Hot spots (self-time)

Findings

Suggested fix (not yet implemented)

Raw data

Run: 2026-04-12 (later) — HEAD after HashMap dedup fix

Wall-clock and counters

Hot spots (self-time, user-space, 9,948 samples, 10 iterations)

Run: 2026-04-12 (later still) — HEAD after path interning

Context

Wall-clock and counters

Hot spots (self-time, user-space, 9,925 samples, 10 iterations)

Candidate next targets (not yet implemented)

See also

Keyboard shortcuts

RSConstruct - Rust Build Tool