The Joy-of-Quartz Unikernel Epic
Goal: bring the full M:N scheduler + real HTTP server + htop-style
live telemetry inside the unikernel, so https://mattkelly.io/
visibly demonstrates Elixir-level concurrency on bare-metal-ish
Quartz — no BEAM, no OTP boilerplate, no gen_servers.
go do -> end spawns a task; channels / select / actors just work;
the browser sees a live dashboard of CPU cycles, memory usage, task
count, requests/sec, scheduler internals. Every cycle of a ~60 KiB
ELF on a QEMU microvm guest, pushed to the limit of what the VPS
can serve.
Starting state (Apr 18 2026, post-session)
Live at https://mattkelly.io/:
- Caddy (:443 TLS via LE) → QEMU microvm hostfwd :8080 → unikernel
- Quartz-authored virtio-net / Ethernet / IPv4 / TCP (16-slot per-connection table) / HTTP/1.1 router / response builder
- Dark landing page +
/api/stats.jsonpolled from the browser - 2,077 connections served pre-handoff, PMM flat at 138 pages, zero leak, 16 concurrent verified
- 60 KiB ELF, 2138 compiler fixpoint functions
What’s missing for the full vision:
- M:N scheduler inside the unikernel. The kernel has a toy
cooperative two-task scheduler (task A / task B). The real
userspace runtime —
go, channels, mutexes, select, actors, work-stealing, ~50K tasks/sec — doesn’t run in the kernel because it assumes pthreads, mmap, kqueue/epoll, and none of those exist here. - HTTP/2 + keepalive + request pipelining. Current server is single-request, Connection: close. Fine for a demo, not enough to squeeze real req/sec out of the VPS.
- htop-style telemetry.
/api/stats.jsonshows 4 static counters. We want per-task CPU cycles (RDTSC), per-CPU utilization, rolling req/sec, memory breakdown by category, scheduler runqueue depth, live task list.
Why this is hard (honest)
The userspace Quartz runtime isn’t a small dependency. codegen’s
concurrency intrinsics alone are ~1,485 LoC (cg_intrinsic_conc_task.qz),
the HTTP/2 server is 3,821 LoC (std/net/http_server.qz), and
both assume an OS below them. Porting into the unikernel requires
building the OS below them in Quartz.
Honest estimate: 6–10 quartz-weeks across 10–15 sessions. This is a real epic, not a weekend hack. Each phase below is independently shippable — intermediate milestones land working demos the user can show off between big pushes.
Phased plan
Phase K — Kernel primitives (foundation, ~1–2 sessions)
The things the scheduler and HTTP server assume exist below them.
K.1 — Slab / freelist allocator. PMM is bump-only. Tasks need dynamically-sized stacks that can be freed when the task exits. Two-tier plan:
- Freelist for 4 KiB pages (reclaim on
tcp_free_slot+ task exit). - Slab allocator for small allocs (64 B, 128 B, 256 B, 512 B, 1024 B size classes) for task structs, channel elements, misc kernel state.
Keep the existing bump allocator as the initial page supplier; slab carves pages into objects, freelist reclaims pages.
~300 LOC kernel. Low risk.
K.2 — Timer subsystem. For sleep(), timeouts in select,
rate limiting.
- Priority queue of (wake_tick, task_id) ordered by wake_tick.
timer_sleep(ticks): park current task, enqueue wakeup, yield.- LAPIC timer ISR checks the queue head each tick; if expired, wake-and-pop.
~200 LOC. Low risk.
K.3 — Atomics audit. Quartz already exposes atomic
intrinsics (cmpxchg, xchg, fetch_add, fence, load/store with
orderings). Verify they compile correctly for freestanding
x86_64 and generate real lock prefixes in the IR. Also verify
volatile_load/store with the Atomic orderings behave right
when inlined into kernel code paths.
Likely zero-LOC change if it already works; possibly a small codegen tweak.
K.4 — IOAPIC + IRQ-driven RX (this is DEF-B). Removes the
virtio_net_rx_wait polling print-pacing hack and lets the
scheduler park in hlt between packets. Without this, a
real scheduler can’t actually sleep.
~400 LOC. HIGH brick risk — must be exhaustively tested on dev QEMU before deploy. A bad IDT/IOAPIC setup halts the CPU; remote recovery requires redeploy over SSH (which works, but the live demo goes dark during debug).
K.5 — RDTSC-based now_cycles() + per-CPU cycle accounting.
For htop demo. rdtsc gives ~3 GHz resolution on modern x86.
Wrap in a now_cycles(): U64 helper, verify monotonicity under
scheduler migration (irrelevant until SMP — we’re UP for now).
~50 LOC. Trivial.
Phase S — Kernel M:N scheduler (~2–3 sessions)
Real goroutines running inside the unikernel.
S.1 — Task runtime.
Taskstruct (slot in a table, similar to our TcpConn table):- state: RUNNABLE / RUNNING / PARKED / DEAD
- stack_base, stack_size
- saved registers (reuse
switch_toasm from KERN.1) - continuation fn pointer (for fresh tasks)
- park reason: NONE / TIMER(wake_tick) / CHANNEL(ch, dir) / IO(fd)
- CPU cycles consumed (for telemetry)
- spawn_tick (for age / “tasks-per-second” calc)
task_new(fn, arg): TaskHandle— allocate slot, stack (slab), seed fake-suspended-frame, enqueue.task_yield()— push self, pick next runnable,switch_to.task_park(reason)/task_wake(task)— state transitions.task_exit()— free stack, mark DEAD, yield.
~400 LOC. Reuses existing switch_to asm + the toy scheduler
pattern. Medium risk (tiny state machine bugs = silent hangs).
S.2 — Runqueue.
Start cooperative and simple: single global runqueue (linked-list or ring buffer) of RUNNABLE task IDs. Pick-next = pop head. No work-stealing, no per-CPU queues — UP makes both irrelevant. Upgrade paths when SMP lands (KERN.8).
~100 LOC.
S.3 — Preemption. LAPIC timer ISR calls task_yield() if
the current task has consumed > N ticks. This is the “pre-emptive”
in “M:N preemptive scheduler.” Care: can’t yield from an ISR
that’s holding a lock. Design rule: ISR sets a “please yield”
flag; the next task_yield_if_flagged() check in kernel code
acts on it.
~100 LOC. Medium risk (subtle re-entrancy bugs possible).
S.4 — Channels. channel_new(capacity), channel_send(ch, val),
channel_recv(ch): val. Internals:
- Fixed-size ring buffer for capacity > 0.
- Send-wait-queue + recv-wait-queue.
send: if buffer has room, push; else park on send-wait.recv: if buffer has item, pop; else park on recv-wait.- Either side’s op wakes the other’s waiter if present.
This is enough for goroutine-per-connection + pipeline. Full unbounded channels, select, and priority are nice-to-haves.
~250 LOC.
S.5 — Mutex. Futex-style: atomic cmpxchg on the state word, park on contention. Wake-one on unlock.
~80 LOC.
S.6 — Wire go keyword to kernel task runtime. The
compiler lowers go foo() to a scheduler spawn call. Currently
that call goes to sched_spawn which assumes pthread-backed
runtime. In kernel mode, need to route to kernel_task_new.
Options:
- Per-target shim: the compiler emits
__qz_sched_spawn_impl()which the target links to either pthread or kernel impl. - Effects-based (clean, but blocks on Effects Phase 3).
Short-term: shim. Add tools/baremetal/libc_stubs.c-equivalent
for the scheduler runtime.
~150 LOC + linker discipline.
Phase H — Real HTTP server in the unikernel (~2–3 sessions)
Goroutine-per-connection model; upgrade HTTP/1.1 → HTTP/1.1 keepalive → optional h2c for HTTP/2.
H.1 — Socket abstraction. The existing kernel does TCP
inline inside tcp_handle_frame. To let a handler be a normal
function, we need socket-like API:
sock_accept(): Conn— task blocks until an ESTABLISHED connection is ready; returns a slot handle.sock_read(Conn): Bytes— task blocks until RX buffer has new data for this conn; returns what’s available.sock_write(Conn, Bytes)— copies to TX path.sock_close(Conn)— initiates FIN.
Under the hood: tcp_handle_frame becomes a packet demultiplexer
that wakes the task owning a slot. The handler runs in its own
task.
~400 LOC. Medium risk — the handoff between ISR and user tasks is subtle.
H.2 — Goroutine-per-connection handler. accept_loop() task
spawns a handler per sock_accept. Handler reads request,
builds response, writes, closes. This is what unlocks real
concurrency + keepalive.
~100 LOC. Easy given S.* + H.1.
H.3 — HTTP/1.1 keepalive + request pipelining. Connection:
keep-alive is the default in HTTP/1.1. Handler loops over
sock_read/sock_write until connection idle-timeout or
peer closes. Squeezes real req/sec out of one TCP handshake.
~100 LOC.
H.4 (stretch) — h2c (HTTP/2 cleartext). Because Caddy terminates TLS, upstream can speak h2c. Needs:
- HPACK encoder + decoder. Port from userspace (no alloc-heavy paths; kernel-friendly).
- HTTP/2 framing: SETTINGS, HEADERS, DATA, PING, GOAWAY, CONTINUATION, WINDOW_UPDATE.
- Per-stream state machine.
- Flow control (connection-level + stream-level).
~2,000 LOC if ported carefully. This is the biggest single phase and the most bug-prone. Consider deferring to a later push if H.3 gives enough throughput for the demo story.
Phase D — Joy-of-Quartz htop demo (~1–2 sessions)
The landing page becomes a live dashboard. This is the marketing payoff.
D.1 — Metrics collection in the kernel.
- Per-task cycle accounting: on
switch_to, sample RDTSC, accrue to outgoing task, reset reference for incoming. - Per-CPU utilization %: ratio of running-task-cycles to total-cycles in the last N ms.
- Rolling request-rate: ring buffer of (tick, conn_count) sampled once/sec, 60 samples. Served as JSON array.
- Memory breakdown: PMM used vs slab used vs task stacks vs scheduler overhead. Four buckets, live.
- Scheduler internals: runqueue depth, tasks spawned total, tasks spawned/sec, tasks currently RUNNABLE / PARKED / DEAD, channel ops/sec.
~300 LOC. Low risk.
D.2 — Richer /api/stats.json schema.
{
"version": 2,
"uptime_ticks": ...,
"cpu": {
"cycles_total": ...,
"cycles_user_task": ...,
"utilization_percent": 34.2,
"utilization_1s": [34, 35, 36, ...60 samples],
},
"memory": {
"pmm_bump_pages": 138,
"slab_pages": 12,
"task_stacks_pages": 42,
"total_pages": 16384
},
"tasks": {
"spawned_total": 1987654,
"spawned_per_sec": 12034,
"runnable": 8,
"parked": 14,
"dead_reaped": 0,
"active_list": [
{"id": 1, "name": "accept_loop", "cycles": 12345678, "state": "PARKED"},
{"id": 7, "name": "http_handler", "cycles": 456, "state": "RUNNING"},
...
]
},
"net": {
"connections_served": ...,
"connections_active": ...,
"req_per_sec_1s": ...,
"req_per_sec_60s_history": [...]
},
"version_info": {
"quartz_fixpoint_functions": 2138,
"kernel_sha": "...",
"uptime_human": "2m 14s"
}
}
Kernel response is larger than current scratch. Bump
g_http_resp_scratch to 4 pages (16 KiB) or stream response
via multi-segment TCP (already have that).
D.3 — Browser dashboard.
Inline JS + CSS, no external assets (everything served from the unikernel). Layout:
- Hero strip: language / unikernel / URL identity.
- Top row: 4 big number gauges (CPU%, tasks/sec, req/sec, mem%).
- Middle: CPU utilization sparkline (last 60 s), req/sec sparkline (last 60 s), memory breakdown bar chart.
- Bottom: scrolling task list (like htop’s row per task) with ID, name, cycles, state, age. Update every 500 ms.
- Footer: compiler fixpoint count, ELF size, source link, kernel commit sha.
~1,000 bytes of dense inline JS + ~500 bytes CSS. Fits in the 4-page scratch we’ll have by then. No external libraries; everything done with vanilla DOM.
~200 LOC kernel (HTML/JS assembly via buf_write_str), 0
external deps.
D.4 — “Elixir in one function call” marketing panel.
A live demo embedded in the page. Single button: “Spawn 10,000
tasks.” On click, POST /api/spawn-demo tells the kernel to
go do -> busy-loop-a-bit-and-exit end × 10K. Page watches
the tasks_spawned_total counter jump by 10K, the tasks_per_sec
gauge spike, the runnable count balloon then settle. Copy:
In Erlang/Elixir, this is a module with
gen_server:start_link/4, a supervision tree, init/1, handle_call/3 — minimum. In Quartz, this is one keyword:go. And the kernel you’re talking to right now is running them.
Small text. Let the visible counters do the work.
~100 LOC.
Marketing angles collected (use on the landing page during D.4)
- “Elixir-style concurrency, no BEAM.” OTP-grade ergonomics, bare-metal footprint.
- “
go fn()vs. the OTP boilerplate checklist.” Literal side-by-side. Small chef’s-kiss nod to Elixir rather than a take-down. - “10,000 tasks, one keyword, 60 KB kernel.” Fit everything on one screen with one button.
- “No libc. No Linux. No C in the network stack. No GitHub. Just Quartz.” The last one is user-directive per this session; reinforces the sourcehut stance.
- “Every cycle of this page — from TLS termination through HPACK encoding through task scheduling through response builder — was written in a language that wrote itself.” Recursive flex; honest.
Open questions for user (decisions that shape the sequencing)
- Branch strategy. Develop the epic on a worktree branch
(
worktree-joy-unikernel-epic) and deploy only at end of each phase, keepingmattkelly.ioon current stable? Or work on trunk and push through any breakage? - H.4 HTTP/2 h2c: priority or stretch? H.3 (HTTP/1.1 keepalive + pipelining) gets 80% of the req/sec story with 20% of the code. H.4 is a full additional session’s work and the biggest bug source. Consider shipping without h2c initially.
- Scheduler SMP: now or later? UP only = simpler, no work-stealing, no per-CPU queues, no cache-coherency discipline. SMP is KERN.8 in the roadmap — far stretch. The Joy demo still looks glorious at one CPU.
- PSQ-10 fix (DEF-C) before or during epic?
and/orcodegen mallocs per eval. Hot scheduler paths will hit this and allocate per tick. Worth fixing upstream first — compiler guard-gated, ~3–5 quartz-hours. - Preemption aggressiveness. How much of a task’s time slice before forced yield? 1 ms (100 ticks @ 100 Hz)? 10 ms? Tunable; affects demo feel.
- TLS in the unikernel (KERN.5)? The demo says “Caddy terminates TLS” honestly — not as impressive as “unikernel does TLS too.” KERN.5 is a 3000 LoC stretch (port rustls- equivalent or hand-write TLS 1.3). Far future.
Risks / sequencing advice
- K.4 (IOAPIC) first brick risk. Do it on dev QEMU only
for several sessions. Do not deploy until rock-solid.
Keep the old
virtio_net_rx_waitpolling path behind a compile-time flag so rollback is one-line. - Scheduler bugs are silent. A miswired register save in
switch_towon’t print anything; the CPU just goes sideways. Heavy use ofuart_put_strdebug tracers during development + heavy fuzz-style testing with random spawn/yield/park sequences. - HPACK is a bug magnet. If H.4 happens, plan for extra buffer. Port the existing userspace HPACK rather than rewriting — the tests come with.
- PSQ-10 leak. Audit every kernel hot path for
and/orusage. Currently we use nestedifintcp_find_slotfor exactly this reason. Extend the discipline; or fix DEF-C. - ELF size growth. Each phase adds code. Current 60 KB. Target after D: probably ~150 KB. Still tiny. Not a problem unless we lose track.
- Live demo stays up during epic. Keep
tmp/baremetal/ quartz-unikernel-stable.elfas a snapshot of the current HEAD before each deploy. If a new build wedges the VPS,scp stable.elf → systemctl restartrecovers in <10 s.
Suggested session order
session 1: K.1 slab allocator
session 2: K.2 timer + K.3 atomics audit + K.5 RDTSC
session 3: K.4 IOAPIC IRQ-driven RX (dev-only; no deploy)
session 4: K.4 stabilize + DEF-C PSQ-10 compiler fix
session 5: S.1 task runtime + S.2 runqueue
session 6: S.3 channels + S.4 select
session 7: S.5 mutex + S.6 `go` keyword wiring
session 8: H.1 socket abstraction + H.2 goroutine-per-conn
session 9: H.3 keepalive + pipelining
session 10: D.1 metrics + D.2 schema
session 11: D.3 browser dashboard + D.4 Elixir marketing panel
session 12+: H.4 h2c, KERN.5 TLS, KERN.8 SMP — real stretch
Each session lands a shippable unit. User can pause, take a week off, come back without losing state.
First step for the next session
Open tools/baremetal/hello_x86.qz at pmm_alloc_page (line
~1500). Add a slab allocator that sits on top of the bump
allocator. Size classes: 64, 128, 256, 512, 1024 bytes. Free
list per class. API: slab_alloc(size): Int, slab_free(ptr, size): Void.
Stress test: 100K random alloc/free cycles don’t exhaust PMM.
Then kalloc / kfree wrappers that pick slab vs. page based
on size.
Commit cleanly. Run baremetal:qemu_http regression. Don’t
deploy yet — K.1 alone doesn’t change external behavior.
The live demo at https://mattkelly.io/ must stay up, unchanged,
until the end of Phase K. All experimentation happens in dev
QEMU.
Good luck. Go make something that makes Elixir people jealous.