Session 4 Handoff — Iomap Fix + Lessons Learned

Date: Apr 11, 2026 (overnight session) Starting state: trunk at 96ad7a95 (session 3 handoff) Ending state: trunk at 38c88277 — clean, 2 commits ahead

What shipped

Two commits on trunk, neither pushed:

38c88277 Regression tests for iomap fixes + refresh Linux golden binary
b86bde04 Guard iomap access in channel codegen: sync-context crash + sched_shutdown UAF

`b86bde04` — iomap null/dangling guard

Three related bugs in channel codegen that all involved @__qz_sched[12] (the iomap base pointer) being either zero or dangling:

try_send in sync code crashed. select { send(ch, 42) => 0 end } from a pure-synchronous program segfaulted on the first successful send: the inlined try_send intrinsic unconditionally read the iomap base and indexed it to wake any io-suspended peer. In sync code the scheduler never initializes, so slot 12 is zero and the xchg deref’d null+fd*8.
channel_close had the same pattern (different label prefix) and crashed the same way from sync code.
sched_shutdown freed the iomap via call void @free(...) but never zeroed slot 12. Subsequent channel_close chased the freed pointer.

Fixes:

cg_intrinsic_conc_channel_try.qz: insert an .iomap.has null check before the xchg in try_send’s io_wake block. Branches to ts_io_skip_ if iomap is null.
cg_intrinsic_conc_channel.qz: same guard for channel_close (cc_ prefix).
codegen_runtime.qz: zero slot 12 after free(%io.map.raw) in __qz_sched_shutdown.

Impact: unlocked concurrency_spec.qz (57 tests) and channel_result_spec.qz (6 tests), plus async_io_spec.qz (6 tests) and scheduler_spec.qz (3/4 — test 4 is documented flaky pre-existing). 72 tests newly passing with zero regressions in existing async/closure/ concurrency specs.

`38c88277` — regression tests + golden refresh

Adds three new tests to async_spill_regression_spec.qz:

select-send on empty buffered channel from sync code
select-send with default arm picks send when ready
channel_close after sched_shutdown

And refreshes the Linux golden binary. Total 11 regression tests, all green. ROADMAP updated to reflect the two new “Known Bugs → Fixed” entries and move concurrency_spec / channel_result_spec / async_io_spec / scheduler_spec off the “Pre-existing failures still open” list (leaving only backpressure_spec 1/7 and tls_async_spec 0/6, both confirmed pre-existing on the pre-fix golden).

What was attempted and reverted

Several commits (c43a5416, b6c929ad, 20cf7fb2, 54a771ee) were made and then reset via git reset --hard 38c88277. The pattern:

c43a5416: Call/vec/tuple spill — attempted to extend session 3’s binary-op spill/reload fix (3440903f) to also cover:
- NODE_CALL argument lists (f("pre", await h, "post"))
- NODE_ARRAY literals ([1, await h, 3])
- NODE_TUPLE_LITERAL ((7, await h, 99))
All three crash llc with “Instruction does not dominate all uses” pre-fix, so the compiler-emission bug is real and waiting to be fixed.

Why it was reverted: the fix’s binary broke actor_spec.qz compilation. The compiler emitted truncated IR (8370 lines instead of ~63440), stopping mid-function at define i64 @ with no function name. I root-caused it partway (a default-arg slot in arg_nodes carried a sentinel value that wasn’t a valid AST node ID; my new mir_any_sibling_contains_await helper walked into ast_get_kind with a garbage index → size-37 OOB) but even after adding a defensive bounds check (node >= ast_node_count(s)), the rebuilt compiler still produced truncated IR for actor_spec specifically.

The root cause is still open. Actor codegen emits __Future_<pointer>$new and __Future_<pointer>$poll functions whose name contains a raw pointer value — that’s weird enough to warrant its own investigation.
b6c929ad: Call/vec/tuple regression tests — depended on c43a5416.
20cf7fb2: Zero completion_map + task_locals slots in sched_shutdown — same bug pattern as the iomap fix in b86bde04. After sched_shutdown, spawn+await hung forever because @__qz_completion_map[3] was freed but not nulled, and the await logic uses that slot as a “scheduler active” proxy.

I verified the fix worked in isolation (a minimal sched_init + go + await + sched_shutdown + spawn + await hung pre-fix, ran clean post-fix). But the binary that tested it was built atop the broken c43a5416 binary, so the rebuild cycle tangled.

The completion_map source fix is worth re-applying in a fresh session. It’s a ~20-line diff to codegen_runtime.qz that adds store i64 0, ptr ... after each free(...) for @__qz_completion_map[0/1/2/3] and @__qz_task_locals[0/1/2/3]. Patch reproduced in an appendix below.
54a771ee: ROADMAP — depended on 20cf7fb2.

Why the rebuild got tangled

The Linux binary at self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden is the ONLY working Linux compiler — macOS binaries don’t help and there’s no alternative build path. Every compiler source edit requires rebuilding via that golden, which then OVERWRITES the golden with the rebuilt binary (the whole point is to verify fixpoint).

If the rebuilt binary has a subtle bug that only surfaces when compiling certain programs (like actor_spec), the bug is hidden by the fixpoint check (which only compares self-compilation output, not arbitrary-program output). And once you’re on the bad binary, rebuilding from source doesn’t fix you unless you have a known-good older golden in hand — which, because I kept overwriting it, I ultimately had to recover from git history (git show 38c88277:self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden).

Lesson for next session:

Before any compiler change, save a copy of the current golden under a fix-specific name (quartz-pre-<fix-name>-golden) and do NOT overwrite that copy under any circumstances.
After the rebuild, run quake fixpoint AND compile a few non-self programs (actor_spec.qz, concurrency_spec.qz, stream_skip_while-style generator tests) to catch program-specific codegen regressions that fixpoint misses.
If the rebuild fails those smoke tests, restore the saved pre-fix golden IMMEDIATELY — do not iterate on fixes from a suspect binary.

What I learned (deep insights worth carrying forward)

“Free without zero” is a recurring compiler pattern. Every global slot in @__qz_sched / @__qz_completion_map / @__qz_task_locals that holds a heap pointer has the same latent UAF: freeing without zeroing leaves the slot looking “active” to the ptr != 0 proxy that various codegen sites use. The iomap fix only closed one hole; completion_map and task_locals have the same shape (still unfixed).
The “active” proxy check itself is fragile. Using @__qz_sched[12] != 0 as “is the scheduler running?” is cute but brittle — it relies on teardown code being disciplined about zeroing. A dedicated @__qz_sched_active: i64 that sched_init/sched_shutdown toggle explicitly would be more robust but touches every check site. Worth doing eventually.
Default-arg placeholders in arg_nodes look like AST node IDs but aren’t. When na_default_mask != 0, the compiler leaves sentinel values in the children vec for skipped slots. Any traversal that walks children without consulting the mask will feed garbage to ast_get_kind → size-N array OOB. My mir_expr_contains_await walker needed a bounds check (node >= ast_node_count(s)) as defense in depth — that’s the right fix regardless of the default-mask issue.
__Future_<raw_pointer>$new/$poll names are a codegen fossil. actor_spec emits functions named with literal pointer values (e.g. @__Future_1082221776$new). Every build produces different pointer values, making the IR non-deterministic. This smells like actor/future construction using a pointer-as-identity hash instead of a stable identifier. The fact that my compiler broke on THIS specific file suggests the pointer-name path interacts badly with something in my attempted fix — and also that the whole pointer-naming approach is technical debt worth cleaning up.
scheduler_spec test 4 is 60% flaky. I ran it 10+ times and it passes ~4/10. It’s the “pipe-based async task wakes via I/O poller” test that earlier handoffs called “1/4 flaky — OOM-killed exit 137”. Environment- dependent timing.

Suggested next targets

A. Re-apply the completion_map fix cleanly (high ROI)

~20-line addition to codegen_runtime.qz. Pure additive — only adds store i64 0, ptr %slot lines after existing free(...) calls. Reveals the post-shutdown spawn+await hang documented above. Regression test is straight- forward:

def rs_spawn_after_shutdown(): Int
  sched_init(0)
  var h1 = go rs_double(5)
  var r1 = await h1
  sched_shutdown()
  var h2 = spawn rs_double(9)
  var r2 = await h2
  return r1 + r2
end

Patch:

# After each free(...) call in __qz_sched_shutdown, add:
codegen_util::cg_emit_line(out, "  store i64 0, ptr %cm.arr.p2")
codegen_util::cg_emit_line(out, "  %cm.cnt.p2 = getelementptr [4 x i64], [4 x i64]* @__qz_completion_map, i64 0, i64 1")
codegen_util::cg_emit_line(out, "  store i64 0, ptr %cm.cnt.p2")
codegen_util::cg_emit_line(out, "  %cm.cap.p2 = getelementptr [4 x i64], [4 x i64]* @__qz_completion_map, i64 0, i64 2")
codegen_util::cg_emit_line(out, "  store i64 0, ptr %cm.cap.p2")
# ... after the mutex free:
codegen_util::cg_emit_line(out, "  store i64 0, ptr %cm.mtx.p2")
# Same pattern for @__qz_task_locals slots 0/1/2/3

Full diff in stash (see git stash show + grep for completion_map).

B. Call/vec/tuple spill fix — retry, with smoke tests

The pattern is right; my implementation had subtle issues. Next attempt:

Start with the vec/tuple portions only. Call-args has extra complications (default masks, sentinel values).
Skip the default-mask case entirely. if na_default_mask != 0: no spill.
Keep the defensive bounds check I added to mir_expr_contains_await. That’s a good invariant regardless.
Smoke-test aggressively before committing. After the rebuild, compile:
- actor_spec.qz (full IR lines, not truncated)
- concurrency_spec.qz
- A file with generators + lambda args
- The whole regression spec suite

C. Investigate the `__Future_<pointer>$<name>` naming

Search for where the __Future_ prefix is emitted with a raw pointer value in the suffix. Probably in actor/future lowering. Replace with a stable counter or content hash. Deterministic names unlock IR snapshot testing for actor code.

D. Task-local storage spec (`tls_async_spec`) 0/6

0/6 is a wall, not a flake. Worth a real investigation — could be another free-without-zero or a genuine missing feature. Start by reading the spec and one of its failing tests to understand what’s expected.

Known flaky or pre-existing (NOT tonight’s regression)

Spec	State	Note
`scheduler_spec.qz` test 4	~40% pass	”pipe-based async task wakes via I/O poller” — environment timing
`backpressure_spec.qz`	1/7	Compiles now (improvement from pre-fix golden), but policy tests fail
`tls_async_spec.qz`	0/6	Task-local storage — pre-existing, reproduces on old golden
`actor_spec.qz`	21/21	Works on 38c88277 golden; any compiler edit must preserve this
`async_spill_regression_spec.qz`	11/11	The session 3 + session 4 regression file

How to resume

cd /home/mathisto/projects/quartz-git
git log --oneline trunk -3   # 38c88277 should be at top

# Recover a working Linux golden from git (in case backups/ is contaminated)
git show 38c88277:self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden \
  > /tmp/q_known_good && chmod +x /tmp/q_known_good
/tmp/q_known_good --version   # expect: quartz 5.12.21-alpha

# Before starting any compiler edit, save the current golden:
cp self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden \
   self-hosted/bin/backups/quartz-pre-<fix-name>-golden

# DO NOT overwrite quartz-pre-<fix-name>-golden until the fix is committed.

The worktree at /home/mathisto/projects/quartz-head/ was heavily tangled during debugging and contains uncommitted bootstrap stubs plus misc files. Don’t trust it — reset hard to a clean state before using it:

cd /home/mathisto/projects/quartz-head
git stash drop        # drop any stashes from session 4
git restore --source=HEAD --worktree .

Then re-apply the bootstrap stubs from the main repo as needed (the ast_func_is_cconv_c references in mir_lower.qz and resolver.qz need to be if false # bootstrap stub: ... for the Linux golden to compile the main-repo source).