Quartz v5.25

Handoff — Quartz unikernel serving the full Astro site (Apr 19 2026)

Session summary. Took the unikernel from “3 hardcoded routes

  • 16-slot TCP table, crashed this morning” to “88 baked Astro pages + full ETag/304 caching + hardened virtio TX + detailed bug docs for everything unfixed.” Four commits on branch unikernel-site (worktree at .claude/worktrees/unikernel-site/). User’s piezo/effects session on trunk was not touched.

Live: http://195.35.36.247:8080/ — dynamic landing (Quartz telemetry) + 88 baked routes served byte-exact from PMM.

What landed

1. Asset-bake pipeline (641853a8)

  • tools/bake_assets.qz (new, 180 lines) — walks site/dist/ via sh_capture("find -L site/dist -type f | sort"), hex-escapes every byte, emits tools/baremetal/site_assets.qz (gitignored, regenerate with quake baremetal:bake_assets).
  • tools/baremetal/hello_x86.qz grew +200 lines: 128-slot asset table, copy_str_to_pmm, FNV-1a hash, ETag emit + match, router extensions, chunker rewrite to walk headers-scratch + baked-body as one virtual stream.
  • Quakefile.qz — new baremetal:bake_assets task; build_elf + qemu_http now concat hello_x86.qz + site_assets.qz before compile. Generated source is 14 MB; compiler chews through it in ~2 seconds.
  • One gotcha you’ll hit if you write more tool-side .qz programs: import * from quake resolves to tools/quake.qz (the launcher) rather than std/quake.qz because the adjacent- file search beats the -I std path. bake_assets.qz inlines its own shell_capture to dodge the collision.

2. TX path hardening + 209 KB bug doc (ce0cd525)

Removed virtio_net_tx_send’s “fake-complete on 10M spin” escape hatch — it was silently corrupting the descriptor ring when the device backed up. Now spins indefinitely (a genuinely broken device visibly hangs the kernel rather than dropping packets), bumps g_tx_stalls on the first 10M-spin milestone, exposed in /api/stats.json.

The 209 KB stall we hit during Phase 2 is a separate bug (not what the fake-complete fix addressed). Root cause confirmed via max_seg experiment: peer Linux’s net.core.rmem_default = 212992 caps per-connection receive buffer at ~208 KB; since we don’t honor the advertised TCP window and don’t retransmit, segments past that are dropped and never recovered. Full writeup in docs/bugs/UNIKERNEL_TX_STALL_209KB.md. Workaround: bake filter skips 3 docs > 200 KB.

3. ETag + 304 Not Modified (2a1c26dd)

FNV-1a 64-bit over each body at register time → 16-hex in a fresh PMM page. ETag: "<16hex>" emitted on all 200 responses; router scans If-None-Match: "<etag>" and returns headers-only 304 on match. Asset entry size grew 48 → 64, table backing grew 1 → 2 PMM pages. Verified live end-to-end.

4. Asset stats exposed on landing (85610d2c)

assets + assets_bytes in /api/stats.json. New “Baked” card on the dynamic landing: “88 / 2292 KiB of docs + CSS + JS”, updated by the existing 500 ms JS poll.

What’s live

$ curl -sSi http://195.35.36.247:8080/marketing | head -8
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 20300
ETag: "0834acc5eab2a2d6"
Connection: close
Server: Quartz-unikernel
Cache-Control: public, max-age=60

$ curl -H 'If-None-Match: "0834acc5eab2a2d6"' ... → HTTP 304
$ curl .../api/stats.json
{"version":1,...,"tx_stalls":0,"assets":88,"assets_bytes":2347635,"mac":"52:54:00:12:34:56"}

What’s next, ordered by punch-through

A. The RX ring stall that started this session

docs/bugs/UNIKERNEL_RX_RING_STALL.md. Symptom: used.idx stops advancing after ~some-hours of serving; kernel keeps seeing [rx: u=16444 c=...] frames arriving but never consuming them. Workaround: systemctl restart quartz-unikernel. Real fix is DEF-B — IOAPIC + IRQ-driven RX — instead of the tight polling loop. Multi-session epic; rated HIGH brick-risk on the VPS because a bad IDT/IOAPIC setup halts the CPU.

Cheap intermediate win while waiting for DEF-B: wire a systemd ExecStartPre health probe that calls curl -m 3 /health every 30s via cron/systemd-timer on the host, and systemctl restart the unikernel on two consecutive failures. Keeps the service up while the root cause work happens elsewhere.

B. TCP receive window + retransmits (DEF-D subset)

Unlocks the 3 skipped docs > 200 KB (quartz_reference 228 KB, two large roadmap archives). Minimum viable fix:

  1. Parse peer’s advertised window from ACK frames (tcp_hdr + 14, big-endian u16). Store in the conn slot.
  2. Chunker: track bytes_inflight = snd_nxt - snd_una. Don’t send the next segment if bytes_inflight + chunk > peer_window.
  3. To advance past a closed window, the chunker must yield back to the RX loop so it can process ACKs. This is the hard part — the current dispatch is synchronous. Options: a. Convert the chunker to a per-connection state machine driven by RX events (ACK arrives → send next chunk). b. Add a sub-RX poll inside the chunker: after every N segments, service any pending RX frames, then resume.
  4. Retransmits: track per-segment (seq, len, timestamp). On ACK, mark acked segments. Periodically resend unacked beyond RTO.

Estimate: 4-8 quartz-hours. Honest kernel work.

C. HTTPS via Caddy subdomain

Blocked on DNS — user needs to add unikernel.mattkelly.io A 195.35.36.247 (or similar). Once DNS is live:

# on VPS, append to /etc/caddy/Caddyfile:
unikernel.mattkelly.io {
    reverse_proxy localhost:8080
}
# then:
systemctl reload caddy

Let’s Encrypt cert lands on first HTTPS hit, zero impact on the existing mattkelly.io (fly.io) config.

D. In-browser WASM playground

Blocked on TGT.3 (direct WASM backend, not started). The /playground page currently serves fine but the “Run” button can only ever fall back to “compile via Caddy-proxied backend service,” which we don’t run. When the backend lands, wire a POST endpoint in the unikernel that accepts source, spawns the in-browser WASM compiler, streams output.

E. Smaller polish still worth doing

  1. Styled 404 page (currently plain text) — match the dark theme of the landing. 20-min job.
  2. Add /api/recent.json — a 64-slot ring buffer of last N served paths + timestamps, so the landing can show “what others are viewing.” Showcases a live-kernel-state demo the visitor can watch update in real time.
  3. Bake the oversized docs into multiple smaller chunks with server-side concat-on-request, as a stop-gap before B lands.
  4. Add /api/build_info.json — ELF size at boot (via a linker symbol), baked-at timestamp (passed in at build time via a --define flag we don’t have yet), compiler version.

Repo state

  • Branch: unikernel-site, 4 commits ahead of trunk at 1d90d51b.
  • Worktree dir: .claude/worktrees/unikernel-site/. Contains a site/dist -> /Users/mathisto/projects/quartz/site/dist symlink (local-only convenience, not tracked).
  • Merge target: once TX window work is done, merge to trunk. Until then, branch stays separate so the effects-epic work on trunk doesn’t have to carry the unikernel changes.
  • Production ELF: mattkelly.io:/opt/quartz/quartz-unikernel.elf, 2.42 MB. Regenerate with:
    quake baremetal:bake_assets   # if site/dist changed
    quake baremetal:build_elf
    scp tmp/baremetal/quartz-unikernel.elf mattkelly.io:/opt/quartz/
    ssh mattkelly.io systemctl restart quartz-unikernel

One thing I’d want the next Claude to NOT do

Don’t implement the 200 KB TX-stall workaround as “pace the chunker with a delay between segments.” It might work by giving SLIRP time to drain but it’s the wrong model — we’d be papering over a missing protocol feature (flow control) with timing, which breaks on any link with different latency characteristics. Plan B (real TCP window) is the only honest fix. Skipping the 3 large docs is better than a fragile timing hack.