← Back to writing

How I accidentally slandered Rust with a microbenchmark

I added runnable Rust and TypeScript to my LeetCode writeups last week. The UI is a read-only code block with an "Edit & run" button; click it, the code becomes editable, hit Run, and you see pass/fail per test case plus a timing.

The timings were embarrassing:

Rust   Two Sum (HashMap)
case 1   5,310 ns
case 2     570 ns
case 3     350 ns

TypeScript  Two Sum (HashMap)
case 1   27 ns
case 2   33 ns
case 3   22 ns

Rust, ~100× slower than TypeScript. On the same algorithm.

I published this. Briefly. Then I spent an afternoon figuring out which part of the setup was lying.

Round 1 — the timer was lying

First clue: TypeScript's timings were 0 ns on every case. Not slow, not fast — zero.

performance.now() in a Web Worker is clamped to roughly 100μs of resolution in Chromium. The algorithm is ~10 ns. The timer can't see anything that fast. It floors to zero.

Fix: don't measure a single call; measure many and divide. The TS worker now does this:

function measure(fn, args) {
  // Warm JIT so TurboFan compiles the hot path.
  for (let w = 0; w < 3; w++) fn(...args);

  let iters = 100;
  while (iters <= 1_000_000) {
    const start = performance.now();
    for (let i = 0; i < iters; i++) fn(...args);
    const total = performance.now() - start;
    if (total >= 2) return (total / iters);
    iters *= 10;
  }
}

Exponential growth until the wall-clock total is above the timer's noise floor (2 ms), then divide for ns-per-call. This is the oldest microbenchmark trick in the book and I didn't do it on the first pass.

After this, TypeScript reported 27 / 33 / 22 ns. Real numbers. Meanwhile Rust was still at 5,310 ns on case 1.

Round 2 — Rust was running in debug

The API that posts user code to the Rust Playground had one line that decided everything:

body: JSON.stringify({
  channel: 'stable',
  mode: 'debug',   // ← this
  edition: '2021',
  ...
}),

Debug Rust does not optimize. Every Vec::get() is a real function call with a bounds check. Solution::two_sum is not inlined. The brute-force inner loop doesn't auto-vectorize. Every one of these is an optimization that -O3 does for free and cargo build (without --release) does not.

The fix is trivial — change the string — but it introduces a tradeoff. Debug compiles in ~1 second on the shared playground. Release compiles in 5–15 seconds cold. That's the difference between "Run is a button you click" and "Run is a button you click and then wait for".

I didn't want to force a 10-second wait on every correctness check. So now there are two buttons:

  • Run — debug, ~1s compile, for "does this pass the tests"
  • Benchmark — release, slow, for "how fast is this actually"

The result header labels which build produced the numbers: 3/3 cases passed · debug (muted grey) or · release (green). Hover the tag for a tooltip explaining the tradeoff.

After adding Benchmark, release Rust dropped from 5,310 ns to 1,710 ns on case 1.

Still ~60× TypeScript. We're not done.

Round 3 — the clone was inside the timer

Here's the Rust harness that was running:

let start = Instant::now();
let got = Solution::two_sum(nums.clone(), *target);
let ns = start.elapsed().as_nanos();

Look at nums.clone(). That's inside the timer.

The function signature is fn two_sum(nums: Vec<i32>, target: i32) -> Vec<i32> — it takes nums by value. Rust's ownership model means if I want to keep nums for the next test case, I have to clone it before passing it in. A Vec<i32> clone is two heap allocations (the Vec itself, plus its backing buffer) and a memcpy. For a 4-element vec that's in the ~100–300 ns range even in release.

TypeScript has no equivalent cost. const j = seen.get(need) passes seen by reference. The array argument is by reference. There's no allocation in the hot loop.

So "Rust's algorithm cost" in my harness was really algorithm + allocator + memcpy, and "TypeScript's algorithm cost" was algorithm. And I was comparing those two numbers as if they measured the same thing.

The fix is to pre-allocate every cloned input outside the timer:

fn bench_clone<T, F>(input: &T, mut f: F) -> u128
where
    T: Clone,
    F: FnMut(T),
{
    for _ in 0..3 {
        f(input.clone());
    }

    let mut iters: u128 = 100;
    while iters <= 1_000_000 {
        let pool: Vec<T> = (0..iters).map(|_| input.clone()).collect();
        let start = Instant::now();
        for x in pool {
            f(x);
        }
        let total_ns = start.elapsed().as_nanos();
        if total_ns >= 2_000_000 {
            return total_ns / iters;
        }
        iters *= 10;
    }
    0
}

Same shape as the TypeScript worker: 3 warmup calls, exponential iter growth until wall-clock ≥ 2 ms, divide for ns-per-call. The cloning is still there — you can't move a Vec twice in Rust — but it happens up front. The timer only sees the algorithm plus a 24-byte move per iteration.

One more thing that matters here:

black_box(Solution::two_sum(black_box(n), black_box(*target)));

Without black_box, LLVM can see that the return value is unused, prove the function has no observable side effects, and delete the call. The entire loop. You get "0 ns" and think you just wrote the fastest hashmap lookup in history. std::hint::black_box tells the optimizer to pretend the value escapes, so the call stays.

This is the classic Rust benchmark trap. It bites everyone exactly once.

Numbers after round 3:

Rust release  Two Sum (HashMap)
case 1   67 ns
case 2   93 ns
case 3   72 ns

TypeScript
case 1   27 ns
case 2   33 ns
case 3   22 ns

~2–3×. That's a real language difference, not a methodology artifact.

What the remaining gap actually is

Rust's default HashMap uses SipHash — a DDoS-resistant hash. V8's Map uses something much cheaper because the engine isn't worried about hash-collision attacks on in-memory maps. If you swap Rust's default to ahash or FxHash, the gap closes or flips the other way.

There's also still a 24-byte Vec move per iteration in Rust that TypeScript's reference-passing skips. Not free. Not the algorithm's fault.

On the brute-force approach (no HashMap, pure loops), release Rust runs at 67 ns on case 1 and TypeScript runs at about the same. They're comparable when the workload doesn't route through a hash function.

Which is what you'd expect. It's just that "what you'd expect" was invisible under three layers of methodology noise.

The takeaway I keep forgetting

Every time I see a cross-language microbenchmark that looks shocking — "X is 100× Y!" — the first question is always what's in the timing region? Then: is there amortization? Then: what build mode? Then: is there a warmup?

Methodology often dwarfs the underlying phenomenon for tiny workloads. A 4-element vec in a tight loop has an algorithm cost measured in single-digit nanoseconds. A single heap allocation is 100+. A cold I-cache miss is 50+. A timer-resolution floor is 100,000. Any one of those, uncorrected, produces a headline that has nothing to do with the language.

The live benchmark on my writeups is at /leet/problem/1. Click Rust, hit Benchmark, expand any case. The numbers you see are after all three rounds of this debugging. They may still be wrong in ways I haven't noticed yet.