The Joy-of-Quartz Unikernel Epic

Goal: bring the full M:N scheduler + real HTTP server + htop-style live telemetry inside the unikernel, so https://mattkelly.io/ visibly demonstrates Elixir-level concurrency on bare-metal-ish Quartz — no BEAM, no OTP boilerplate, no gen_servers. go do -> end spawns a task; channels / select / actors just work; the browser sees a live dashboard of CPU cycles, memory usage, task count, requests/sec, scheduler internals. Every cycle of a ~60 KiB ELF on a QEMU microvm guest, pushed to the limit of what the VPS can serve.

Starting state (Apr 18 2026, post-session)

Live at https://mattkelly.io/:

Caddy (:443 TLS via LE) → QEMU microvm hostfwd :8080 → unikernel
Quartz-authored virtio-net / Ethernet / IPv4 / TCP (16-slot per-connection table) / HTTP/1.1 router / response builder
Dark landing page + /api/stats.json polled from the browser
2,077 connections served pre-handoff, PMM flat at 138 pages, zero leak, 16 concurrent verified
60 KiB ELF, 2138 compiler fixpoint functions

What’s missing for the full vision:

M:N scheduler inside the unikernel. The kernel has a toy cooperative two-task scheduler (task A / task B). The real userspace runtime — go, channels, mutexes, select, actors, work-stealing, ~50K tasks/sec — doesn’t run in the kernel because it assumes pthreads, mmap, kqueue/epoll, and none of those exist here.
HTTP/2 + keepalive + request pipelining. Current server is single-request, Connection: close. Fine for a demo, not enough to squeeze real req/sec out of the VPS.
htop-style telemetry. /api/stats.json shows 4 static counters. We want per-task CPU cycles (RDTSC), per-CPU utilization, rolling req/sec, memory breakdown by category, scheduler runqueue depth, live task list.

Why this is hard (honest)

The userspace Quartz runtime isn’t a small dependency. codegen’s concurrency intrinsics alone are ~1,485 LoC (cg_intrinsic_conc_task.qz), the HTTP/2 server is 3,821 LoC (std/net/http_server.qz), and both assume an OS below them. Porting into the unikernel requires building the OS below them in Quartz.

Honest estimate: 6–10 quartz-weeks across 10–15 sessions. This is a real epic, not a weekend hack. Each phase below is independently shippable — intermediate milestones land working demos the user can show off between big pushes.

Phased plan

Phase K — Kernel primitives (foundation, ~1–2 sessions)

The things the scheduler and HTTP server assume exist below them.

K.1 — Slab / freelist allocator. PMM is bump-only. Tasks need dynamically-sized stacks that can be freed when the task exits. Two-tier plan:

Freelist for 4 KiB pages (reclaim on tcp_free_slot + task exit).
Slab allocator for small allocs (64 B, 128 B, 256 B, 512 B, 1024 B size classes) for task structs, channel elements, misc kernel state.

Keep the existing bump allocator as the initial page supplier; slab carves pages into objects, freelist reclaims pages.

~300 LOC kernel. Low risk.

K.2 — Timer subsystem. For sleep(), timeouts in select, rate limiting.

Priority queue of (wake_tick, task_id) ordered by wake_tick.
timer_sleep(ticks): park current task, enqueue wakeup, yield.
LAPIC timer ISR checks the queue head each tick; if expired, wake-and-pop.

~200 LOC. Low risk.

K.3 — Atomics audit. Quartz already exposes atomic intrinsics (cmpxchg, xchg, fetch_add, fence, load/store with orderings). Verify they compile correctly for freestanding x86_64 and generate real lock prefixes in the IR. Also verify volatile_load/store with the Atomic orderings behave right when inlined into kernel code paths.

Likely zero-LOC change if it already works; possibly a small codegen tweak.

K.4 — IOAPIC + IRQ-driven RX (this is DEF-B). Removes the virtio_net_rx_wait polling print-pacing hack and lets the scheduler park in hlt between packets. Without this, a real scheduler can’t actually sleep.

~400 LOC. HIGH brick risk — must be exhaustively tested on dev QEMU before deploy. A bad IDT/IOAPIC setup halts the CPU; remote recovery requires redeploy over SSH (which works, but the live demo goes dark during debug).

K.5 — RDTSC-based now_cycles() + per-CPU cycle accounting. For htop demo. rdtsc gives ~3 GHz resolution on modern x86. Wrap in a now_cycles(): U64 helper, verify monotonicity under scheduler migration (irrelevant until SMP — we’re UP for now).

~50 LOC. Trivial.

Phase S — Kernel M:N scheduler (~2–3 sessions)

Real goroutines running inside the unikernel.

S.1 — Task runtime.

Task struct (slot in a table, similar to our TcpConn table):
- state: RUNNABLE / RUNNING / PARKED / DEAD
- stack_base, stack_size
- saved registers (reuse switch_to asm from KERN.1)
- continuation fn pointer (for fresh tasks)
- park reason: NONE / TIMER(wake_tick) / CHANNEL(ch, dir) / IO(fd)
- CPU cycles consumed (for telemetry)
- spawn_tick (for age / “tasks-per-second” calc)
task_new(fn, arg): TaskHandle — allocate slot, stack (slab), seed fake-suspended-frame, enqueue.
task_yield() — push self, pick next runnable, switch_to.
task_park(reason) / task_wake(task) — state transitions.
task_exit() — free stack, mark DEAD, yield.

~400 LOC. Reuses existing switch_to asm + the toy scheduler pattern. Medium risk (tiny state machine bugs = silent hangs).

S.2 — Runqueue.

Start cooperative and simple: single global runqueue (linked-list or ring buffer) of RUNNABLE task IDs. Pick-next = pop head. No work-stealing, no per-CPU queues — UP makes both irrelevant. Upgrade paths when SMP lands (KERN.8).

~100 LOC.

S.3 — Preemption. LAPIC timer ISR calls task_yield() if the current task has consumed > N ticks. This is the “pre-emptive” in “M:N preemptive scheduler.” Care: can’t yield from an ISR that’s holding a lock. Design rule: ISR sets a “please yield” flag; the next task_yield_if_flagged() check in kernel code acts on it.

~100 LOC. Medium risk (subtle re-entrancy bugs possible).

S.4 — Channels. channel_new(capacity), channel_send(ch, val), channel_recv(ch): val. Internals:

Fixed-size ring buffer for capacity > 0.
Send-wait-queue + recv-wait-queue.
send: if buffer has room, push; else park on send-wait.
recv: if buffer has item, pop; else park on recv-wait.
Either side’s op wakes the other’s waiter if present.

This is enough for goroutine-per-connection + pipeline. Full unbounded channels, select, and priority are nice-to-haves.

~250 LOC.

S.5 — Mutex. Futex-style: atomic cmpxchg on the state word, park on contention. Wake-one on unlock.

~80 LOC.

S.6 — Wire go keyword to kernel task runtime. The compiler lowers go foo() to a scheduler spawn call. Currently that call goes to sched_spawn which assumes pthread-backed runtime. In kernel mode, need to route to kernel_task_new.

Options:

Per-target shim: the compiler emits __qz_sched_spawn_impl() which the target links to either pthread or kernel impl.
Effects-based (clean, but blocks on Effects Phase 3).

Short-term: shim. Add tools/baremetal/libc_stubs.c-equivalent for the scheduler runtime.

~150 LOC + linker discipline.

Phase H — Real HTTP server in the unikernel (~2–3 sessions)

Goroutine-per-connection model; upgrade HTTP/1.1 → HTTP/1.1 keepalive → optional h2c for HTTP/2.

H.1 — Socket abstraction. The existing kernel does TCP inline inside tcp_handle_frame. To let a handler be a normal function, we need socket-like API:

sock_accept(): Conn — task blocks until an ESTABLISHED connection is ready; returns a slot handle.
sock_read(Conn): Bytes — task blocks until RX buffer has new data for this conn; returns what’s available.
sock_write(Conn, Bytes) — copies to TX path.
sock_close(Conn) — initiates FIN.

Under the hood: tcp_handle_frame becomes a packet demultiplexer that wakes the task owning a slot. The handler runs in its own task.

~400 LOC. Medium risk — the handoff between ISR and user tasks is subtle.

H.2 — Goroutine-per-connection handler. accept_loop() task spawns a handler per sock_accept. Handler reads request, builds response, writes, closes. This is what unlocks real concurrency + keepalive.

~100 LOC. Easy given S.* + H.1.

H.3 — HTTP/1.1 keepalive + request pipelining. Connection: keep-alive is the default in HTTP/1.1. Handler loops over sock_read/sock_write until connection idle-timeout or peer closes. Squeezes real req/sec out of one TCP handshake.

~100 LOC.

H.4 (stretch) — h2c (HTTP/2 cleartext). Because Caddy terminates TLS, upstream can speak h2c. Needs:

HPACK encoder + decoder. Port from userspace (no alloc-heavy paths; kernel-friendly).
HTTP/2 framing: SETTINGS, HEADERS, DATA, PING, GOAWAY, CONTINUATION, WINDOW_UPDATE.
Per-stream state machine.
Flow control (connection-level + stream-level).

~2,000 LOC if ported carefully. This is the biggest single phase and the most bug-prone. Consider deferring to a later push if H.3 gives enough throughput for the demo story.

Phase D — Joy-of-Quartz htop demo (~1–2 sessions)

The landing page becomes a live dashboard. This is the marketing payoff.

D.1 — Metrics collection in the kernel.

Per-task cycle accounting: on switch_to, sample RDTSC, accrue to outgoing task, reset reference for incoming.
Per-CPU utilization %: ratio of running-task-cycles to total-cycles in the last N ms.
Rolling request-rate: ring buffer of (tick, conn_count) sampled once/sec, 60 samples. Served as JSON array.
Memory breakdown: PMM used vs slab used vs task stacks vs scheduler overhead. Four buckets, live.
Scheduler internals: runqueue depth, tasks spawned total, tasks spawned/sec, tasks currently RUNNABLE / PARKED / DEAD, channel ops/sec.

~300 LOC. Low risk.

D.2 — Richer /api/stats.json schema.

{
  "version": 2,
  "uptime_ticks": ...,
  "cpu": {
    "cycles_total": ...,
    "cycles_user_task": ...,
    "utilization_percent": 34.2,
    "utilization_1s": [34, 35, 36, ...60 samples],
  },
  "memory": {
    "pmm_bump_pages": 138,
    "slab_pages": 12,
    "task_stacks_pages": 42,
    "total_pages": 16384
  },
  "tasks": {
    "spawned_total": 1987654,
    "spawned_per_sec": 12034,
    "runnable": 8,
    "parked": 14,
    "dead_reaped": 0,
    "active_list": [
      {"id": 1, "name": "accept_loop", "cycles": 12345678, "state": "PARKED"},
      {"id": 7, "name": "http_handler", "cycles": 456, "state": "RUNNING"},
      ...
    ]
  },
  "net": {
    "connections_served": ...,
    "connections_active": ...,
    "req_per_sec_1s": ...,
    "req_per_sec_60s_history": [...]
  },
  "version_info": {
    "quartz_fixpoint_functions": 2138,
    "kernel_sha": "...",
    "uptime_human": "2m 14s"
  }
}

Kernel response is larger than current scratch. Bump g_http_resp_scratch to 4 pages (16 KiB) or stream response via multi-segment TCP (already have that).

D.3 — Browser dashboard.

Inline JS + CSS, no external assets (everything served from the unikernel). Layout:

Hero strip: language / unikernel / URL identity.
Top row: 4 big number gauges (CPU%, tasks/sec, req/sec, mem%).
Middle: CPU utilization sparkline (last 60 s), req/sec sparkline (last 60 s), memory breakdown bar chart.
Bottom: scrolling task list (like htop’s row per task) with ID, name, cycles, state, age. Update every 500 ms.
Footer: compiler fixpoint count, ELF size, source link, kernel commit sha.

~1,000 bytes of dense inline JS + ~500 bytes CSS. Fits in the 4-page scratch we’ll have by then. No external libraries; everything done with vanilla DOM.

~200 LOC kernel (HTML/JS assembly via buf_write_str), 0 external deps.

D.4 — “Elixir in one function call” marketing panel.

A live demo embedded in the page. Single button: “Spawn 10,000 tasks.” On click, POST /api/spawn-demo tells the kernel to go do -> busy-loop-a-bit-and-exit end × 10K. Page watches the tasks_spawned_total counter jump by 10K, the tasks_per_sec gauge spike, the runnable count balloon then settle. Copy:

In Erlang/Elixir, this is a module with gen_server:start_link/4, a supervision tree, init/1, handle_call/3 — minimum. In Quartz, this is one keyword: go. And the kernel you’re talking to right now is running them.

Small text. Let the visible counters do the work.

~100 LOC.

Marketing angles collected (use on the landing page during D.4)

“Elixir-style concurrency, no BEAM.” OTP-grade ergonomics, bare-metal footprint.
“go fn() vs. the OTP boilerplate checklist.” Literal side-by-side. Small chef’s-kiss nod to Elixir rather than a take-down.
“10,000 tasks, one keyword, 60 KB kernel.” Fit everything on one screen with one button.
“No libc. No Linux. No C in the network stack. No GitHub. Just Quartz.” The last one is user-directive per this session; reinforces the sourcehut stance.
“Every cycle of this page — from TLS termination through HPACK encoding through task scheduling through response builder — was written in a language that wrote itself.” Recursive flex; honest.

Open questions for user (decisions that shape the sequencing)

Branch strategy. Develop the epic on a worktree branch (worktree-joy-unikernel-epic) and deploy only at end of each phase, keeping mattkelly.io on current stable? Or work on trunk and push through any breakage?
H.4 HTTP/2 h2c: priority or stretch? H.3 (HTTP/1.1 keepalive + pipelining) gets 80% of the req/sec story with 20% of the code. H.4 is a full additional session’s work and the biggest bug source. Consider shipping without h2c initially.
Scheduler SMP: now or later? UP only = simpler, no work-stealing, no per-CPU queues, no cache-coherency discipline. SMP is KERN.8 in the roadmap — far stretch. The Joy demo still looks glorious at one CPU.
PSQ-10 fix (DEF-C) before or during epic? and/or codegen mallocs per eval. Hot scheduler paths will hit this and allocate per tick. Worth fixing upstream first — compiler guard-gated, ~3–5 quartz-hours.
Preemption aggressiveness. How much of a task’s time slice before forced yield? 1 ms (100 ticks @ 100 Hz)? 10 ms? Tunable; affects demo feel.
TLS in the unikernel (KERN.5)? The demo says “Caddy terminates TLS” honestly — not as impressive as “unikernel does TLS too.” KERN.5 is a 3000 LoC stretch (port rustls- equivalent or hand-write TLS 1.3). Far future.

Risks / sequencing advice

K.4 (IOAPIC) first brick risk. Do it on dev QEMU only for several sessions. Do not deploy until rock-solid. Keep the old virtio_net_rx_wait polling path behind a compile-time flag so rollback is one-line.
Scheduler bugs are silent. A miswired register save in switch_to won’t print anything; the CPU just goes sideways. Heavy use of uart_put_str debug tracers during development + heavy fuzz-style testing with random spawn/yield/park sequences.
HPACK is a bug magnet. If H.4 happens, plan for extra buffer. Port the existing userspace HPACK rather than rewriting — the tests come with.
PSQ-10 leak. Audit every kernel hot path for and/or usage. Currently we use nested if in tcp_find_slot for exactly this reason. Extend the discipline; or fix DEF-C.
ELF size growth. Each phase adds code. Current 60 KB. Target after D: probably ~150 KB. Still tiny. Not a problem unless we lose track.
Live demo stays up during epic. Keep tmp/baremetal/ quartz-unikernel-stable.elf as a snapshot of the current HEAD before each deploy. If a new build wedges the VPS, scp stable.elf → systemctl restart recovers in <10 s.

Suggested session order

session 1: K.1 slab allocator
session 2: K.2 timer + K.3 atomics audit + K.5 RDTSC
session 3: K.4 IOAPIC IRQ-driven RX (dev-only; no deploy)
session 4: K.4 stabilize + DEF-C PSQ-10 compiler fix
session 5: S.1 task runtime + S.2 runqueue
session 6: S.3 channels + S.4 select
session 7: S.5 mutex + S.6 `go` keyword wiring
session 8: H.1 socket abstraction + H.2 goroutine-per-conn
session 9: H.3 keepalive + pipelining
session 10: D.1 metrics + D.2 schema
session 11: D.3 browser dashboard + D.4 Elixir marketing panel
session 12+: H.4 h2c, KERN.5 TLS, KERN.8 SMP — real stretch

Each session lands a shippable unit. User can pause, take a week off, come back without losing state.

First step for the next session

Open tools/baremetal/hello_x86.qz at pmm_alloc_page (line ~1500). Add a slab allocator that sits on top of the bump allocator. Size classes: 64, 128, 256, 512, 1024 bytes. Free list per class. API: slab_alloc(size): Int, slab_free(ptr, size): Void. Stress test: 100K random alloc/free cycles don’t exhaust PMM.

Then kalloc / kfree wrappers that pick slab vs. page based on size.

Commit cleanly. Run baremetal:qemu_http regression. Don’t deploy yet — K.1 alone doesn’t change external behavior.

The live demo at https://mattkelly.io/ must stay up, unchanged, until the end of Phase K. All experimentation happens in dev QEMU.

Good luck. Go make something that makes Elixir people jealous.