Tokio Internals — The Runtime You Think You Know
Schedulers, wakers, and why every `await` is a structured decision.
Foundation · Core Engine & Data Structures
Why this week
You've written tokio::spawn a thousand times. This week you're going to learn what it actually costs. A spawn is a heap allocation + a task push to a lock-free queue + waker setup; it's fast, but it's not free, and in a matching engine that sheds 50K events/sec you feel every one of those costs as p99 latency.
You'll build two small primitives — a concurrent gather() that joins futures without futures_util, and a Histogram that lets you speak to tail latency instead of averages — and use them to measure the overhead of tokio::spawn against running the same work inline.
The Rust Playground runs on a shared multi-tenant CPU. Latency numbers here are directional, not authoritative — they'll show "spawn has an order-of-magnitude overhead over inline", but for publishable numbers you run criterion on your laptop.
Day 1 — Reading the scheduler
Tokio's worker loop
Learn the source, not the slogans
Read `run_task`, `next_task`, and the work-stealing loop. Every ready task either runs on the local run-queue or gets stolen — there is no magic.
The PR explains why the old scheduler bottlenecked on a single global queue. The fix — per-worker local queues + a stealing protocol — is the shape of every modern async runtime.
What invariant does the worker loop maintain between LIFO slot, local queue, and global queue? Why does the order matter for latency vs throughput?
When a future's waker is called, what does it actually *do*? Trace the call: Waker::wake() → ... → eventual `run_task`.
Day 2 — Blocking and the I/O driver
What is blocking?
Alice Ryhl's framing, plus the reactor
The definition: 'blocking' = holding the worker thread without yielding. CPU-bound code blocks; file I/O without async blocks; even a busy loop between awaits blocks.
Walk through one async handler you've written. Where does the worker block between awaits? What would spawn_blocking vs rayon vs yielding loop buy you?
Day 3 — Gather without external crates
Compose futures by hand
If you understand Future + Waker, you don't need futures_util
Build a gather that polls N futures concurrently and returns their results in the same order they were submitted. It's what futures::future::join_all does — rewrite it so you understand what's happening.
The simplest working implementation: spawn each future as a tokio task, hold the JoinHandles, await each in order. That does give concurrent execution — but tokio::spawn requires 'static futures. For the drill we'll do it the "real" way: poll in place on the current task, waking on any child readiness.
Implement concurrent gather()
use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};
// Poll every future to completion concurrently on the current task.
// Returns results in the same order as the inputs. Exactly like
// futures::future::join_all, without depending on futures_util.
pub async fn gather<F, T>(mut futures: Vec<F>) -> Vec<T>
where
F: Future<Output= T> + Unpin,
{
// Suggested approach:
// - Wrap each future in Some(f).
// - In a loop: for each Some slot, poll it. If Ready, record result +
// set slot to None. If all slots are None, return the results.
// - If we made no progress this sweep, `std::future::pending` away —
// the inner futures' wakers will bring us back.
//
// Hint: you can build a minimal poll helper:
// let mut cx_local = Context::from_waker(cx.waker());
// match Pin::new(&mut f).poll(&mut cx_local) { ... }
//
// Hint: use std::future::poll_fn to get access to the outer Context.
let _ = &mut futures;
todo!()
}
Day 4 — Latency histogram
Speak to the tail
Averages lie; percentiles don't
Co-ordinated omission, why averages over-report good-path behaviour, and why p99 is the number trading systems care about.
The brute Vec<u64> + sort + index works, but it's O(n log n) and allocates. A bucketed histogram gets you O(1) record and O(buckets) quantiles. Logarithmic bucketing — each bucket covers 10% of the value range — is the standard trick.
Log-bucketed histogram
// Fixed-size histogram with logarithmic bucketing. 1 ns .. 1 s coverage
// with ~128 buckets keeps error under 1% per Gil Tene's spec.
pub struct Histogram {
buckets: [u64; 128],
count: u64,
sum: u64,
}
impl Histogram {
pub fn new() -> Self {
Self { buckets: [0; 128], count: 0, sum: 0 }
}
pub fn record(&mut self, ns: u64) {
// Map ns → bucket index. Suggested: bucket = (value.log2 * 4)
// clamped to [0, 127]. This gives ~15% relative error per bucket,
// tightened by interpolation in `percentile`.
let _ = ns;
todo!()
}
/// `q` in [0.0, 1.0]. Returns the bucket's *upper bound* for the
/// bucket containing the q-th quantile. For q=0.99 and 10k samples,
/// that's the upper bound of the bucket containing the 9900th sample.
pub fn percentile(&self, q: f64) -> u64 {
let _ = q;
todo!()
}
pub fn count(&self) -> u64 { self.count }
pub fn mean(&self) -> u64 {
if self.count == 0 { 0 } else { self.sum / self.count }
}
}
impl Default for Histogram {
fn default() -> Self { Self::new() }
}Day 5 — tokio-console (local-only)
Real-time scheduler introspection
Run this outside the Playground
Install: `cargo install tokio-console`. Enable in your app with the `tokio-console-subscriber` crate. Then run `tokio-console` in another terminal. Watch task counts, poll times, and resource utilization live.
Run your arb bot (or any of your async services) under tokio-console for 60 seconds. Which task has the longest p95 poll duration? What would you change?
Day 6 — Task architecture design
One task per what?
The critical design decision
Sketch the task layout for your future exchange. One task per connection? Per shard? Per subscription? Which tasks own state vs receive messages? Justify each with the reading from Days 1-2.
Capstone — spawn-overhead latency report
Use your gather + histogram to quantify spawn overhead
TargetHistogram reports p99. Spawn ≥5× slower than inline on debug. Methodology documented.
// The capstone wires gather + Histogram together to answer: what does
// tokio::spawn cost on this machine, at the tail?
//
// Required exports:
// pub struct Histogram; impl Histogram { pub fn new(), record(u64), percentile(f64) -> u64, count() -> u64 }
// async fn gather<F, T>(Vec<F>) -> Vec<T> where F: Future<Output = T> + Unpin
use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};
pub struct Histogram {
buckets: [u64; 128],
count: u64,
sum: u64,
}
impl Histogram {
pub fn new() -> Self { Self { buckets: [0; 128], count: 0, sum: 0 } }
pub fn record(&mut self, ns: u64) { todo!() }
pub fn percentile(&self, q: f64) -> u64 { todo!() }
pub fn count(&self) -> u64 { self.count }
}
pub async fn gather<F, T>(mut futures: Vec<F>) -> Vec<T>
where
F: Future<Output= T> + Unpin,
{
let _ = &mut futures;
todo!()
}
The capstone harness feeds known latencies and verifies p50/p99 accuracy; it also runs gather on small futures and checks ordering.
Running the same trivial work via tokio::spawn is measurably slower than running it inline — the ratio is recorded as BENCH.
Your Journal names the machine, Rust version, debug vs release, and samples-per-run for the numbers you report. Mark done after writing.
Feeds into
- Week 3 uses these primitives to measure graceful-shutdown timing in the ConnectionManager.
- Week 4 reuses
Histogramto measure Axum request latencies. - Week 12 ties everything together in the end-to-end integration benchmark.