Phase 1Week 02~100 min read

0%·0/5

Tokio Internals — The Runtime You Think You Know

Schedulers, wakers, and why every `await` is a structured decision.

Foundation · Core Engine & Data Structures

Why this week

You've written tokio::spawn a thousand times. This week you're going to learn what it actually costs. A spawn is a heap allocation + a task push to a lock-free queue + waker setup; it's fast, but it's not free, and in a matching engine that sheds 50K events/sec you feel every one of those costs as p99 latency.

You'll build two small primitives — a concurrent gather() that joins futures without futures_util, and a Histogram that lets you speak to tail latency instead of averages — and use them to measure the overhead of tokio::spawn against running the same work inline.

ℹ️

Note

The Rust Playground runs on a shared multi-tenant CPU. Latency numbers here are directional, not authoritative — they'll show "spawn has an order-of-magnitude overhead over inline", but for publishable numbers you run criterion on your laptop.

Prereqs →W01 · Order Book Fundamentals

Day 1 — Reading the scheduler

Day 01

Tokio's worker loop

Learn the source, not the slogans

Code

tokio-rs/tokio — runtime::scheduler::multi_thread::worker

Read `run_task`, `next_task`, and the work-stealing loop. Every ready task either runs on the local run-queue or gets stolen — there is no magic.

Paperby Carl Lerche

Making the Tokio scheduler 10x faster

The PR explains why the old scheduler bottlenecked on a single global queue. The fix — per-worker local queues + a stealing protocol — is the shape of every modern async runtime.

Design note#worker-loop

What invariant does the worker loop maintain between LIFO slot, local queue, and global queue? Why does the order matter for latency vs throughput?

0 chars

Design note#waker-identity

When a future's waker is called, what does it actually *do*? Trace the call: Waker::wake() → ... → eventual `run_task`.

0 chars

Day 2 — Blocking and the I/O driver

Day 02

What is blocking?

Alice Ryhl's framing, plus the reactor

Videoby Alice Ryhl · ~45m

Async: What is blocking?

The definition: 'blocking' = holding the worker thread without yielding. CPU-bound code blocks; file I/O without async blocks; even a busy loop between awaits blocks.

Design note#blocking-audit

Walk through one async handler you've written. Where does the worker block between awaits? What would spawn_blocking vs rayon vs yielding loop buy you?

0 chars

Day 3 — Gather without external crates

Day 03

Compose futures by hand

If you understand Future + Waker, you don't need futures_util

Build a gather that polls N futures concurrently and returns their results in the same order they were submitted. It's what futures::future::join_all does — rewrite it so you understand what's happening.

💡

Tip

The simplest working implementation: spawn each future as a tokio task, hold the JoinHandles, await each in order. That does give concurrent execution — but tokio::spawn requires 'static futures. For the drill we'll do it the "real" way: poll in place on the current task, waking on any child readiness.

Drill

Implement concurrent gather()

Target: All 4 cases pass with ordered resultsNot yet run

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};

// Poll every future to completion concurrently on the current task.
// Returns results in the same order as the inputs. Exactly like
// futures::future::join_all, without depending on futures_util.
pub async fn gather<F, T>(mut futures: Vec<F>) -> Vec<T>
where
    F: Future<Output= T> + Unpin,
{
    // Suggested approach:
    //   - Wrap each future in Some(f).
    //   - In a loop: for each Some slot, poll it. If Ready, record result +
    //     set slot to None. If all slots are None, return the results.
    //   - If we made no progress this sweep, `std::future::pending` away —
    //     the inner futures' wakers will bring us back.
    //
    // Hint: you can build a minimal poll helper:
    //   let mut cx_local = Context::from_waker(cx.waker());
    //   match Pin::new(&mut f).poll(&mut cx_local) { ... }
    //
    // Hint: use std::future::poll_fn to get access to the outer Context.
    let _ = &mut futures;
    todo!()
}

Rust · runnable

Day 4 — Latency histogram

Day 04

Speak to the tail

Averages lie; percentiles don't

Paperby Gil Tene

How NOT to measure latency

Co-ordinated omission, why averages over-report good-path behaviour, and why p99 is the number trading systems care about.

The brute Vec<u64> + sort + index works, but it's O(n log n) and allocates. A bucketed histogram gets you O(1) record and O(buckets) quantiles. Logarithmic bucketing — each bucket covers 10% of the value range — is the standard trick.

Drill

Log-bucketed histogram

Target: p50/p99 within 1% of the true valueNot yet run

// Fixed-size histogram with logarithmic bucketing. 1 ns .. 1 s coverage
// with ~128 buckets keeps error under 1% per Gil Tene's spec.

pub struct Histogram {
    buckets: [u64; 128],
    count: u64,
    sum: u64,
}

impl Histogram {
    pub fn new() -> Self {
        Self { buckets: [0; 128], count: 0, sum: 0 }
    }

    pub fn record(&mut self, ns: u64) {
        // Map ns → bucket index. Suggested: bucket = (value.log2 * 4)
        // clamped to [0, 127]. This gives ~15% relative error per bucket,
        // tightened by interpolation in `percentile`.
        let _ = ns;
        todo!()
    }

    /// `q` in [0.0, 1.0]. Returns the bucket's *upper bound* for the
    /// bucket containing the q-th quantile. For q=0.99 and 10k samples,
    /// that's the upper bound of the bucket containing the 9900th sample.
    pub fn percentile(&self, q: f64) -> u64 {
        let _ = q;
        todo!()
    }

    pub fn count(&self) -> u64 { self.count }
    pub fn mean(&self) -> u64 {
        if self.count == 0 { 0 } else { self.sum / self.count }
    }
}

impl Default for Histogram {
    fn default() -> Self { Self::new() }
}

Rust · runnable

Day 5 — tokio-console (local-only)

Day 05

Real-time scheduler introspection

Run this outside the Playground

Code

tokio-rs/console

Install: `cargo install tokio-console`. Enable in your app with the `tokio-console-subscriber` crate. Then run `tokio-console` in another terminal. Watch task counts, poll times, and resource utilization live.

Design note#console-observation

Run your arb bot (or any of your async services) under tokio-console for 60 seconds. Which task has the longest p95 poll duration? What would you change?

0 chars

Day 6 — Task architecture design

Day 06

One task per what?

The critical design decision

Design note#task-architecture

Sketch the task layout for your future exchange. One task per connection? Per shard? Per subscription? Which tasks own state vs receive messages? Justify each with the reading from Days 1-2.

0 chars

Capstone — spawn-overhead latency report

Capstone

Use your gather + histogram to quantify spawn overhead

TargetHistogram reports p99. Spawn ≥5× slower than inline on debug. Methodology documented.

// The capstone wires gather + Histogram together to answer: what does
// tokio::spawn cost on this machine, at the tail?
//
// Required exports:
//   pub struct Histogram; impl Histogram { pub fn new(), record(u64), percentile(f64) -> u64, count() -> u64 }
//   async fn gather<F, T>(Vec<F>) -> Vec<T> where F: Future<Output = T> + Unpin

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};

pub struct Histogram {
    buckets: [u64; 128],
    count: u64,
    sum: u64,
}

impl Histogram {
    pub fn new() -> Self { Self { buckets: [0; 128], count: 0, sum: 0 } }
    pub fn record(&mut self, ns: u64) { todo!() }
    pub fn percentile(&self, q: f64) -> u64 { todo!() }
    pub fn count(&self) -> u64 { self.count }
}

pub async fn gather<F, T>(mut futures: Vec<F>) -> Vec<T>
where
    F: Future<Output= T> + Unpin,
{
    let _ = &mut futures;
    todo!()
}

Rust · runnable

Histogram + gather work together#impl-complete

The capstone harness feeds known latencies and verifies p50/p99 accuracy; it also runs gather on small futures and checks ordering.

Spawn overhead measured#bench-delta

Running the same trivial work via tokio::spawn is measurably slower than running it inline — the ratio is recorded as BENCH.

Methodology recorded (manual)#methodology

Your Journal names the machine, Rust version, debug vs release, and samples-per-run for the numbers you report. Mark done after writing.

Inline call p50 (ns)

· · ·

Target

<=100

Spawn p50 (ns) — higher on purpose

· · ·

Target

>=200

Spawn-to-inline ratio

· · ·

Target

>=5

Feeds into

Feeds into →W03 · Memory Layout & Alignment W04 · Lock-Free Ring Buffer W12 · Engine Integration v1

Week 3 uses these primitives to measure graceful-shutdown timing in the ConnectionManager.
Week 4 reuses Histogram to measure Axum request latencies.
Week 12 ties everything together in the end-to-end integration benchmark.

What

I Learnt

Struggles

Tokio Internals — The Runtime You Think You Know

Why this week

Day 1 — Reading the scheduler

Tokio's worker loop

Day 2 — Blocking and the I/O driver

What is blocking?

Day 3 — Gather without external crates

Compose futures by hand

Implement concurrent gather()

Day 4 — Latency histogram

Speak to the tail

Log-bucketed histogram

Day 5 — tokio-console (local-only)

Real-time scheduler introspection

Day 6 — Task architecture design

One task per what?

Capstone — spawn-overhead latency report

Use your gather + histogram to quantify spawn overhead

Feeds into