Overnight Session Summary — Ready for macOS merge
Started: trunk at 96ad7a95 (session 3 handoff)
Ended: trunk at cbc8274a — 7 commits ahead of origin/trunk, unpushed
cbc8274a ROADMAP: record channel_new urem fix + remove backpressure_spec from open list
020e8194 Channel capacity: use exact user-requested cap, not next power of 2
cf0da41d ROADMAP: record completion_map fix + document actor_spec non-determinism
73dd0479 Zero completion_map + task_locals slots in __qz_sched_shutdown
48a94793 Session 4 handoff: iomap fix wins + rebuild-cycle lessons
38c88277 Regression tests for iomap fixes + refresh Linux golden binary
b86bde04 Guard iomap access in channel codegen: sync-context crash + sched_shutdown UAF
Headline numbers
- 4 real compiler bugs fixed (all in concurrency/channel codegen)
- ~85 newly-passing tests unlocked in spec suite
- Trunk is source-clean — no stray diffs, no broken files
- Regression coverage added: async_spill_regression_spec grows from 8 tests (session 3 end) to 12 tests (session 4)
- Linux golden at
self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-goldenstays at 38c88277-era state — intentionally NOT rebuilt due to a pre-existing Linux non-determinism bug (see §“Known caveat” below). The macOS golden (self-hosted/bin/quartz) should pick up all source fixes on the nextquake buildyou run on macOS.
The 4 bugs fixed
1. b86bde04 — iomap null + dangling guard (try_send / channel_close / sched_shutdown)
Three related bugs in channel codegen, all involving @__qz_sched[12] (the scheduler’s iomap base pointer):
try_sendin pure-synchronous code crashed on the first successful send because it unconditionally dereferenced@__qz_sched[12]to wake io-suspended peers. In sync code the scheduler never initializes, so slot 12 is zero → null+fd*8 crash.channel_closehad the same pattern at a different label prefix.sched_shutdownfreed the iomap viafree(...)but never zeroed slot 12, so a subsequentchannel_closechased the freed pointer.
Unlocked: concurrency_spec (57 tests, was SEGV-at-startup), channel_result_spec (6 tests, was SEGV-at-startup), async_io_spec (6 tests, was silent hang), scheduler_spec (3/4 — test 4 pre-existing flake). 72 tests newly passing.
2. 73dd0479 — completion_map + task_locals zero-after-free in sched_shutdown
Same bug shape as the iomap fix. After sched_shutdown, @__qz_completion_map[3] (mutex pointer used as a “scheduler active” proxy) was freed but not nulled. Subsequent spawn + await took the scheduler-backed completion-pipe path on a dead scheduler and hung forever.
Fix: zero slots 0/1/2/3 of both @__qz_completion_map and @__qz_task_locals after their respective free(...) calls in __qz_sched_shutdown.
Unlocked: post-shutdown spawn + await works cleanly. Regression test rs_spawn_after_shutdown (hangs pre-fix, returns 14 post-fix).
3. 020e8194 — Channel capacity uses exact user-requested cap (not next power of 2)
channel_new(cap) silently rounded capacity up to the next power of 2 so ring-buffer indexing could use a fast and count, (cap-1) mask. User-visible consequence:
channel_new(10)actually held 16 items before blocking (!)channel_pressure(ch)on a half-full channel_new(10) reported 31% instead of 50%channel_remaining(ch)reported 11 instead of 5
Fix: remove the bit-smearing round-up from channel_new, switch all 8 ring-buffer indexing sites from and count, (cap-1) to urem count, cap. The LLVM urem is slower than and in raw cycles but both the send and recv paths already involve pthread_mutex_lock, so the perf impact is in the noise.
Unlocked: backpressure_spec (7/7 passing, was 1/7). Also fixes the subtle “channels hold more than their declared capacity” semantic bug that was never noticed until the backpressure tests exposed it.
4. Cumulative: no regressions in existing concurrency/async specs
After all three fixes, 130 tests across 8 key specs are green on a freshly rebuilt Linux compiler:
concurrency_spec: 57/57
channel_result_spec: 6/6
backpressure_spec: 7/7 ← newly unlocked
async_spill_regression_spec: 12/12
async_channel_spec: 6/6
closure_capture_spec: 24/24
async_mutex_spec: 8/8
task_group_spec: 10/10
What’s in the regression spec now
spec/qspec/async_spill_regression_spec.qz protects every fix from session 3 AND session 4:
| Fix | Commit | Tests |
|---|---|---|
| Binary-op async spill/reload | 3440903f (session 3) | rs_string_concat_across_await, rs_literals_bracket_await, rs_arith_chain, rs_nested_await |
| MIR_AWAIT double-use UAF | e2f829fd (session 3) | rs_await_in_while_cond, rs_await_reuse |
| Closure capture walker for async | 73a14d56 (session 3) | rs_lambda_captures_handle, rs_zero_arg_lambda |
| try_send/channel_close iomap guard | b86bde04 (session 4) | rs_select_send_ready, rs_select_send_multi |
| iomap dangling after sched_shutdown | b86bde04 (session 4) | rs_close_after_shutdown |
| completion_map dangling after sched_shutdown | 73dd0479 (session 4) | rs_spawn_after_shutdown |
12 tests total, all green in ~6ms. Run this as a smoke test after any channel/async codegen touch.
Known caveat for Linux rebuild (not a blocker for macOS)
While refreshing the Linux golden binary, I discovered a pre-existing non-determinism bug in actor codegen:
actor_spec.qzexercises actor classes (actor Counter, etc.).- The compiler emits
__Future_<name>$newand__Future_<name>$pollfunctions for each actor, where<name>is derived from a string stored in_async_func_namesviaas_string(...). - On the 38c88277-era Linux golden, the string resolves correctly and you get deterministic names like
__Future_Counter$__poll$new. - On any freshly-rebuilt Linux binary, the same source resolves to a raw pointer value like
__Future_369251536$new, which differs across runs, causing the IR to truncate mid-function withdefine i64 @(empty name) and llc to reject it.
The bug is environmental (Linux rebuild only) and pre-existing (reproduces without any of my session 4 changes). I did not refresh the Linux golden in this session to avoid regressing actor_spec.qz from 21/21 to 0/21.
Why this is fine for you: the canonical binary per CLAUDE.md is the macOS arm64 self-hosted/bin/quartz. When you run quake build on macOS, it’ll compile from the same (now-fixed) source and produce a fresh macOS golden. The Linux non-determinism is a separate bug specific to this machine’s bootstrap chain and can be investigated in its own session.
If you hit it on macOS (you shouldn’t), the rescue is git show 38c88277:self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden > /tmp/rescue && chmod +x /tmp/rescue — that’s the known-good binary from before the non-determinism regression.
Other work notes
- Session 4 handoff doc is at
HANDOFF_SESSION_4.md— describes the rebuild-cycle debugging that led to this summary. - Session 3 handoff doc is at
HANDOFF_CONCURRENCY_FIXES.md. - Linux bootstrap handoff is at
HANDOFF_LINUX_BOOTSTRAP.md. - ROADMAP at
docs/Roadmap/ROADMAP.md— Known Bugs → Fixed section now has 5 session-4 entries (the 4 fixes above plus the Apr 11 session 3 entries). - Worktree at
/home/mathisto/projects/quartz-head/was used for all Linux builds this session and is in a somewhat tangled state (bootstrap stubs + partial source overrides). Do not trust its state — reset it withgit restore --source=HEAD --worktree .if you ever come back to it.
How to proceed on macOS
cd /path/to/your/quartz/checkout # (macOS working copy)
git fetch # pull tonight's 7 commits
git log --oneline origin/trunk -10 # confirm you see cbc8274a at top
git merge origin/trunk # or rebase, whichever your flow uses
# Rebuild macOS golden with the fixes:
./self-hosted/bin/quake build
./self-hosted/bin/quake fixpoint
# Smoke tests — all should be green:
./self-hosted/bin/quartz spec/qspec/async_spill_regression_spec.qz | llc -filetype=obj -o /tmp/asr.o
clang /tmp/asr.o -o /tmp/asr -lm -lpthread
/tmp/asr # expect: 12 tests, 12 passed
# Key wins to verify:
# 1. select { send(ch, 42) => 0 end } in sync code no longer crashes
# 2. channel_new(10) holds exactly 10 items
# 3. channel_pressure/channel_remaining return correct logical values
# 4. sched_init + go + await + sched_shutdown + spawn + await no longer hangs
# 5. backpressure_spec: 7/7 (was 1/7)
If macOS rebuild succeeds + fixpoint holds + regression spec is 12/12, you’re good to develop on top of these fixes.
What I’d touch next (handoff targets)
If you want to keep pushing in a future session, the highest-ROI items:
- Investigate the
__Future_<pointer>$<name>Linux non-determinism. The value stored in_async_func_names[i]is a String but resolves to a raw int during interpolation on rebuilt Linux binaries. The root cause is probably in howas_string/ existential-type re-tagging interacts with binary layout at certain addresses. Fix would unblock clean Linux rebuilds forever. tls_async_spec(0/6) is actually TLS networking, not task-local storage — it depends on OpenSSL and real sockets. Not a compiler bug; decide whether it’s in scope for the test suite or should be gated behind a feature flag.- Audit
__qz_sched_shutdownfor any remaining free-without-zero patterns. The iomap, completion_map, and task_locals slots are now clean, but other slots (worker state, priority queues at slots 20/24/28, the timer_peers) could have the same shape. Same zero-after-free discipline applies.
Good morning. Everything is committed, trunk is clean, and the work is ready to merge back to your macOS main.