Handoff: Channel Throughput Optimization (15.6M → 25M+ msgs/s)
Handoff Prompt
Copy this into a fresh Claude Code session:
Handoff: Channel Throughput — Power-of-2 Indexing + Conditional Signal + try_send Wake Fix + Sampled Timing
Context
Quartz is a self-hosted systems language with an M:N work-stealing scheduler. We identified the REAL bottlenecks in channel throughput through instruction-level analysis after ruling out lock overhead (macOS pthread_mutex already uses os_unfair_lock, ~5ns), Vyukov MPMC queue (cap=1 seq collision, go-tasks use try_send not inline send), and CAS spinlock (zero improvement over os_unfair_lock).
The #1 bottleneck is `urem i64` (64-bit integer division) for ring buffer indexing: 25-40 cycles (~8-13ns) per send AND recv = 16-26ns per message. This is 60% of the gap to Go (28M msgs/s). Go uses power-of-2 buffer sizes with bitwise AND. Quartz doesn't.
Current state: 15.6M msgs/s. Target: 22-25M msgs/s. Go: 28M msgs/s.
What to build
Implementation plan at: /Users/mathisto/.claude/plans/wiggly-discovering-toucan.md
4 independent optimizations, ordered by impact:
**Step 1: Power-of-2 capacity + AND indexing (~16-26ns savings)**
In `cg_intrinsic_concurrency.qz`: (a) Round up capacity to next power-of-2 in `channel_new` (guard: skip for cap=0 rendezvous). (b) Replace ALL 7 `urem i64 %x, %cap` with `%mask = sub i64 %cap, 1; %mod = and i64 %x, %mask`.
**Step 2: Conditional condvar signal (~3-5ns savings)**
In `cg_intrinsic_concurrency.qz`: Only call `pthread_cond_signal` when buffer transitions empty→non-empty (send) or full→non-full (recv). Skip signal when no waiters possible.
**Step 3: try_send recv_waiter wake (correctness fix)**
In `cg_intrinsic_concurrency.qz`: `try_send` never wakes parked async receivers at offset 168. Add the same wake pattern that blocking `send` has (lines 2994-3024).
**Step 4: Sampled time accounting (~1-2ns amortized)**
In `codegen_runtime.qz`: Worker loop calls `time_mono_ns()` twice per poll. Sample every 64th poll instead, scale by 64.
Key files
- self-hosted/backend/cg_intrinsic_concurrency.qz — Steps 1, 2, 3 (channel_new, send, recv, try_send, try_recv)
- self-hosted/backend/codegen_runtime.qz — Step 4 (worker loop lines 1749-1769)
- tools/sched_bench.qz — Benchmark (read-only)
Verification after each step
1. `./self-hosted/bin/quake build`
2. `./self-hosted/bin/quake fixpoint`
3. Compile+run spec/qspec/concurrency_spec.qz natively — 40/57 pass
4. `./self-hosted/bin/quake sched_bench` — channel_throughput improvement
Current commit: c0c8da9 on trunk branch.
CLAUDE.md for build commands.
Prime Directives: World-class only. No shortcuts. No silent compromises. Fill every gap.
What Was Already Done (This Session)
Shipped (committed)
add9e56: broadcast→signal (16 channel ops) + lock-free try_send/try_recv pre-checkc0c8da9: Pre-check monotonic ordering + P36/P37 roadmap additions
Attempted and Reverted (with learnings)
- Vyukov MPMC queue: Cap=1 sequence collision (pos+1 == pos+cap). Go-tasks use try_send not inline send. 16-byte slots doubled memory. REVERTED.
- CAS spinlock: macOS pthread_mutex already uses os_unfair_lock (~5ns CAS). Zero measurable improvement. REVERTED.
- Key discovery: quartz_demo was running at 762% CPU during ALL initial benchmarks, invalidating results. Killed it; clean benchmarks confirmed baseline.
Key Architectural Insights
try_sendis already INLINE (intrinsic, not function call) — no PLT overhead to eliminate- The $poll state machine adds ~5-8ns overhead per try_send, but fixing this requires MIR-level changes (deferred to P36)
- Direct goroutine handoff (P37) requires waiter queue redesign (deferred, documented in CONCURRENCY_ROADMAP.md)
- The
urem i64division is the #1 bottleneck — 60% of the gap to Go