Handoff: Channel Throughput Optimization (15.6M → 25M+ msgs/s)

Handoff Prompt

Copy this into a fresh Claude Code session:

Handoff: Channel Throughput — Power-of-2 Indexing + Conditional Signal + try_send Wake Fix + Sampled Timing

Context

Quartz is a self-hosted systems language with an M:N work-stealing scheduler. We identified the REAL bottlenecks in channel throughput through instruction-level analysis after ruling out lock overhead (macOS pthread_mutex already uses os_unfair_lock, ~5ns), Vyukov MPMC queue (cap=1 seq collision, go-tasks use try_send not inline send), and CAS spinlock (zero improvement over os_unfair_lock).

The #1 bottleneck is `urem i64` (64-bit integer division) for ring buffer indexing: 25-40 cycles (~8-13ns) per send AND recv = 16-26ns per message. This is 60% of the gap to Go (28M msgs/s). Go uses power-of-2 buffer sizes with bitwise AND. Quartz doesn't.

Current state: 15.6M msgs/s. Target: 22-25M msgs/s. Go: 28M msgs/s.

What to build

Implementation plan at: /Users/mathisto/.claude/plans/wiggly-discovering-toucan.md

4 independent optimizations, ordered by impact:

**Step 1: Power-of-2 capacity + AND indexing (~16-26ns savings)**
In `cg_intrinsic_concurrency.qz`: (a) Round up capacity to next power-of-2 in `channel_new` (guard: skip for cap=0 rendezvous). (b) Replace ALL 7 `urem i64 %x, %cap` with `%mask = sub i64 %cap, 1; %mod = and i64 %x, %mask`.

**Step 2: Conditional condvar signal (~3-5ns savings)**
In `cg_intrinsic_concurrency.qz`: Only call `pthread_cond_signal` when buffer transitions empty→non-empty (send) or full→non-full (recv). Skip signal when no waiters possible.

**Step 3: try_send recv_waiter wake (correctness fix)**
In `cg_intrinsic_concurrency.qz`: `try_send` never wakes parked async receivers at offset 168. Add the same wake pattern that blocking `send` has (lines 2994-3024).

**Step 4: Sampled time accounting (~1-2ns amortized)**
In `codegen_runtime.qz`: Worker loop calls `time_mono_ns()` twice per poll. Sample every 64th poll instead, scale by 64.

Key files
- self-hosted/backend/cg_intrinsic_concurrency.qz — Steps 1, 2, 3 (channel_new, send, recv, try_send, try_recv)
- self-hosted/backend/codegen_runtime.qz — Step 4 (worker loop lines 1749-1769)
- tools/sched_bench.qz — Benchmark (read-only)

Verification after each step
1. `./self-hosted/bin/quake build`
2. `./self-hosted/bin/quake fixpoint`
3. Compile+run spec/qspec/concurrency_spec.qz natively — 40/57 pass
4. `./self-hosted/bin/quake sched_bench` — channel_throughput improvement

Current commit: c0c8da9 on trunk branch.
CLAUDE.md for build commands.

Prime Directives: World-class only. No shortcuts. No silent compromises. Fill every gap.

What Was Already Done (This Session)

Shipped (committed)

add9e56: broadcast→signal (16 channel ops) + lock-free try_send/try_recv pre-check
c0c8da9: Pre-check monotonic ordering + P36/P37 roadmap additions

Attempted and Reverted (with learnings)

Vyukov MPMC queue: Cap=1 sequence collision (pos+1 == pos+cap). Go-tasks use try_send not inline send. 16-byte slots doubled memory. REVERTED.
CAS spinlock: macOS pthread_mutex already uses os_unfair_lock (~5ns CAS). Zero measurable improvement. REVERTED.
Key discovery: quartz_demo was running at 762% CPU during ALL initial benchmarks, invalidating results. Killed it; clean benchmarks confirmed baseline.

Key Architectural Insights

try_send is already INLINE (intrinsic, not function call) — no PLT overhead to eliminate
The $poll state machine adds ~5-8ns overhead per try_send, but fixing this requires MIR-level changes (deferred to P36)
Direct goroutine handoff (P37) requires waiter queue redesign (deferred, documented in CONCURRENCY_ROADMAP.md)
The urem i64 division is the #1 bottleneck — 60% of the gap to Go