Next Session — Compiler Memory Optimization Phase 3
Baseline: ca2f64fc (PSQ-8 closed, 2288 functions, fixpoint verified, smoke 4/4 + 22/22 green)
Target: macOS self-compile 25 GB → <8 GB peak RSS (3.1x reduction, working target)
Scope: Multi-session — three sub-phases, each ~0.5–1.5 quartz sessions
Prime directive: #1 (highest impact) — this is a foundational developer-experience win that makes self-compile viable on 16 GB machines again
Why this is the next big chunk
The roadmap has four candidates for “next big chunk”:
| Option | Effort | Impact | Why not |
|---|---|---|---|
| Scheduler park/wake refactor (#13) | 2–3d | Closes 4 specs + enables #15 async locks | Existing handoff in docs/HANDOFF_PRIORITY_SPRINT.md; smaller chunk |
| Package Manager (#21) | 2–3w | ”THE post-launch priority” | Big domain shift away from compiler work; user’s recent momentum is compiler polish |
| TGT.3 Direct WASM Backend | 10 phases | Canvas Site demo, browser target | Requires upfront research sprint; higher risk of wasted effort without a plan |
| Compiler Memory Phase 3 | 1–2w | 25 GB → <8 GB on macOS, 2–3× faster self-compile | — this is the pick |
Why Phase 3:
-
Real problem, real data. Verified Apr 15, 2026 (after PSQ-8 landed):
quartz --memory-statson a full self-compile shows 25,229 MB peak RSS on macOS. The Phase 2 number of 12.7 GB was from the mimalloc-linked build, but mimalloc was disabled on macOS in Batch A (Apr 14) due to the dyld TLS-slot race (seedocs/PLATFORMS.md). The user runs Apple Silicon. Every self-compile is 35 seconds of swap-thrashing on a 16 GB machine. -
Concrete phase-by-phase breakdown already in hand. Running
--memory-statson trunk HEAD:[mem] init: 1 MB current (0ms) [mem] lex: 3 MB (+2) (1ms) [mem] parse: 42 MB (+39) (13ms) [mem] resolve_pass1: 10462 MB, 82 modules [mem] resolve_pass2: 10495 MB [mem] resolve: 10498 MB (+10456) (2608ms) [mem] typecheck: 17028 MB (+6530) (11521ms) [mem] tc_free: 17028 MB (+0) (11521ms) [mem] mir: 24485 MB (+7457) (32342ms) [mem] codegen: 25229 MB (+744) (35579ms)The three big spenders: resolve (+10.4 GB, largest), mir (+7.5 GB, second), typecheck (+6.5 GB, third). Codegen is already cheap (+744 MB).
tc_freeis a no-op despite its name — not actually freeing anything. -
Single-file experiment confirms parser hot spot. Compiling
self-hosted/frontend/lexer.qzalone (1799 LOC, 4 modules imported) burns 1977 MB in the parse phase. That’s 1.1 MB per source line, which is pathological. Something in the parser (or resolver for the imports) grows O(n²) or worse. -
Already-calibrated gains from Phase 1+2. Phase 1 was
tc_mangleintmap caching (eliminated 12M allocations in typecheck). Phase 2 added 33 more cache sites across MIR and resolver. Same pattern should work for parse/resolve, plus we have the phase-based freeing lever that Phase 1+2 didn’t touch. -
Unblocks everything. A 3× faster self-compile (35s → ~12s) is a 3× speedup on every compiler iteration. Over a 2-week sprint on any downstream feature (WASM backend, package manager, async locks), that’s hours of saved wall time. This is the highest-leverage quality-of-life improvement available.
Phase 3a: Parser / Resolve memory hot spot (~0.5–1 session)
Target: Reduce resolve phase from 10498 MB to ~2000 MB. Reduce single-file lexer.qz parse from 1977 MB to <200 MB.
What we know
- Lex itself is cheap: 3 MB for all 82 self-hosted modules (the lexer produces token streams, not strings).
- Parse on a single file (
lexer.qz, 1799 LOC) burns 1.9 GB. The single-file parse phase is the culprit, not module loading. - Per-LOC cost: ~1.1 MB. The roadmap item #19 specifically flags this: “lexer.qz uses 2 GB for parsing alone (300 if-blocks in while loop)”. Note: the “300 if-blocks” hint is from project memory (6 days old) — verify against current
frontend/lexer.qzandfrontend/parser.qz. - Resolve then scales that up to 10.5 GB across 82 modules. Resolve’s pass1 reads 10,462 MB; pass2 adds only 33 MB. So pass1 is the spender.
What to investigate first
-
Profile the parser on lexer.qz in isolation. The memory hotspot is almost certainly an O(n²) loop in the parser accumulating state. Candidates ranked by likelihood:
- String concatenation in token buffering — if each token appends to a growing string with
s + t, that’s O(total bytes²). - Vec push with implicit reallocation — if Vec growth doubles without releasing old storage (unlikely given
vec.cimplementation, but verify). - AST node slot-value strings — if every AST node allocates a fresh string for
str1/str2instead of reusing interned symbols, multiply by ~30k nodes = 30k strings per module. - Nested if-elif chains building long arms — each arm’s AST potentially heap-allocated separately.
- String concatenation in token buffering — if each token appends to a growing string with
-
Instrument
ps_*functions with allocation counters. Before optimizing, measure. Add a cheap counter (static int) to each parse entry-point, print on exit, and correlate with memory growth. Or just sprinkleeputs("{pos=#{pos}, rss=#{mem_peak_rss() / 1048576} MB}")at each major loop iteration in the suspected hot spot. -
Resolve pass1 is growing 10 GB across 82 modules — verify it’s re-parsing modules or doing something expensive once per module. Look at
self-hosted/resolver.qzfor the loop structure. If it’s 2 GB per small module × 82 modules, the fix is parse-level. If it’s 100 MB per module × 82 with a long tail, it’s a different bug.
Fix shape
Depends on what the profile reveals. Most likely one of:
-
Symbol interning at the parser level. If every
IDENTtoken allocates a new string, switch to aMap<String, Int>symbol table keyed by byte slice — return the existing ID on hit. rustc does this withAtom/Symbol. Zig does this withStringIndexer. -
String slicing instead of copying. If the parser copies substrings out of the source buffer, switch to passing
(offset, length)tuples. The source buffer outlives parse; we don’t need copies. -
AST node pooling. If 30k AST nodes × 16 slot fields × ~100 bytes per allocation = 48 MB per module × 82 = 4 GB, switch to slab allocation for AST nodes. rustc’s
Bumparena is the reference. -
Specific O(n²) loop. If the profile fingers one function, fix it in place.
Measurement
Before: ./self-hosted/bin/quartz --no-cache --memory-stats ... self-hosted/frontend/lexer.qz > /dev/null
Expected before: [mem] parse: 1977 MB (+1974)
Target after: [mem] parse: <200 MB
Full self-compile before: 25,229 MB peak Target after 3a alone: ~15,000 MB peak (saving ~10 GB from the resolve phase by propagating the parse-level gain across 82 modules)
Files to read
self-hosted/frontend/parser.qz(7521 LOC)self-hosted/frontend/lexer.qz(1799 LOC)self-hosted/frontend/ast.qz(AST storage layout — slot allocation patterns)self-hosted/resolver.qz(resolve pass1 structure)self-hosted/shared/string_intern.qz(existing intern machinery we might extend)
Risks
- Parser changes are high-blast-radius. Every compiler iteration touches parse.
quake guardafter every change.quake smokeafter every change. Run the fullvec_element_type_spec,builtin_arity_spec,expand_node_audit_specsweep after each experiment. - Interning changes can break hash stability or error-message source locations. Be careful about what you replace.
Phase 3b: Phase-based AST/MIR freeing (~0.5–1 session)
Target: Reduce mir phase peak from 24,485 MB to ~14,000 MB. Reduce codegen peak from 25,229 MB to ~10,000 MB.
The leverage
- After MIR is lowered, the AST is dead. Codegen reads MIR, not AST. We hold the AST alive through codegen for no reason.
- After codegen is done, MIR is dead. Nothing downstream reads it.
The phase-by-phase growth shows the AST is ~7.5 GB (the MIR delta) and the MIR is probably most of the codegen baseline. Dropping the AST after MIR lowering is worth ~7 GB peak reduction. Dropping MIR after codegen saves less (codegen is the last phase) but would matter if we ever add a post-codegen phase.
Of these two, freeing the AST after MIR is the clear win. It’s plausibly the single biggest memory lever in the entire compiler.
Implementation shape
-
Locate
AstStorageownership. Probably one global or a field onQuartzCompiler. Seeself-hosted/quartz.qztop-level pipeline (the driver —main()walks phases) andself-hosted/frontend/ast.qzfor the storage struct. -
Confirm MIR doesn’t back-reference AST. Grep for
ast_storageandast_get_calls insideself-hosted/backend/. After MIR lowering, these should all be zero. If any codegen function still callsast_get_str1()on an AST handle, we need to finish the MIR lowering work before we can free — or lift those calls out. -
Free the AST at the phase boundary. After
mir_lower::lower_program(...)returns, callast_free(ast_storage)(or equivalent — add the helper if it doesn’t exist). The AST storage is a giant vector of slots; freeing is just releasing the underlying bufs. -
Verify the memory stat intrinsic reports the drop. The
[mem] mir:line should showtc_freeor a newmir_freeline with a negative delta of several GB.
The tc_free mystery
The current memory-stats output shows:
[mem] typecheck: 17028 MB (+6530)
[mem] tc_free: 17028 MB (+0) ← zero delta — nothing freed
tc_free is supposed to free typecheck state after typecheck completes, but it’s not actually releasing any memory. Investigate: is tc_free actually running? Is it freeing something that was already freed? Is the 6.5 GB typecheck added permanently reachable? This might be an independent leak worth fixing alongside.
Measurement
Expected after 3b:
[mem] mir: ~15000 MB (+7500, then -7500 for AST free) or similar
[mem] codegen: ~15744 MB
Files to read
self-hosted/quartz.qz(compiler driver — phase sequencing)self-hosted/frontend/ast.qz(AST storage struct + free helper if any)self-hosted/middle/typecheck.qz(search fortc_freeto understand current no-op behavior)self-hosted/backend/mir_lower.qz(entry point — where to add theast_freecall after it returns)
Risks
- If MIR still holds AST pointers, freeing the AST causes use-after-free. This is why step 2 (grep for back-references) is mandatory before step 3.
tc_freemight be no-op because the state is still referenced. Fixingtc_freerequires tracing what holds ontoTypecheckStateafter typecheck completes.
Phase 3c: @cfg gating unblock (~0.5 session)
Target: Enable @cfg(feature: "...") gating on def-level items (not just import-level). Use it to gate egraph (~457 MB), lint (~103 MB), and any other dev-only modules out of a minimal self-compile.
The blocker
Project memory (6 days old, verify):
“More @cfg gating — egraph (457 MB), lint (103 MB) — blocked by @cfg-on-def SIGSEGV bug”
Somewhere, putting @cfg(feature: "xxx") above a def causes a SIGSEGV at either parse, resolve, or typecheck time. The bug is the blocker; fix it first, then the gating pays off.
Investigate
-
Reproduce the SIGSEGV. Write a minimal file:
@cfg(feature: "xyz") def unused(): Int = 0 def main(): Int = 0Compile with
quartz --no-cache /tmp/cfg.qz. If it crashes, we have a minimal repro. -
Find where
@cfgis processed. Grepparser.qzandresolver.qzforcfg— probably an attribute handler. The handler likely assumes@cfgonly appears on imports, not on defs. -
Fix the handler to support def-level. Skip or elide the def when the feature is absent. Might require threading
g_cfg_featuresthrough to the parser def handler.
Payoff
Once fixed, gate the dev-only modules:
@cfg(feature: "egraph")
def egraph_optimize(...)
Run the --memory-stats pipeline with and without the feature flag to measure the savings. Expected: ~500–600 MB off the peak, shaved from MIR/typecheck.
Risks
@cfgparsing bugs can produce confusing errors. Start with a single def, not a large gate.- Gating in the wrong place can break tests that depend on the gated module. Run the QSpec sweep after each gate addition.
Session sequencing
Session 1 — Phase 3a (parser hot spot)
- Profile
lexer.qzsingle-file parse - Identify the O(n²) culprit
- Fix in place
- Measure: single-file parse <200 MB, full self-compile <15 GB peak
- Commit with fixpoint verified
Session 2 — Phase 3b (phase-based freeing)
- Investigate
tc_free(why is it a no-op?) - Grep MIR/codegen for AST back-references
- Add
ast_freecall after MIR lowering - Measure: codegen phase <10 GB peak
- Commit with fixpoint verified
Session 3 — Phase 3c (@cfg unblock + gating)
- Minimal SIGSEGV repro
- Fix the parser/resolver handler
- Add
@cfg(feature: "egraph")gates - Measure: feature-off build shaves ~500 MB
- Commit with fixpoint verified
Total: ~3 quartz sessions (calibrated from 1–2 weeks traditional per roadmap estimate).
Sessions can run independently. 3a → 3b → 3c is the natural order but not required — 3c could slot in at any point.
Research pointers (Directive 2 — design the full correct solution)
How do the big systems languages handle this problem?
- rustc: Uses
Bumparenas per compilation unit, with typed arenas for different node kinds. Each phase drops its own arena. Symbol interning viaAtom— everyIDENTis au32index into a global table.ThinVec<T>for small collections to avoid header overhead. See rustc’sarena.rsandsymbol.rs. - Go compiler: Phase-based AST drops. The
types2package explicitly releases*Infoafter typechecking. Go’sparser.Fileowns everything and is drop-once. - Zig: Per-function arena.
Ast.zigholds all source-level info in a flat array ofNode;Sema.zigtranslates intoZirand can free the AST. Symbol interning viaInternPool. - LLVM/clang:
BumpPtrAllocatorfor AST, drops between TUs. Clang’sASTContextowns everything; there’s no phase freeing within a TU, but TU boundaries are hard drops.
The common pattern: arena per phase, drop at phase boundary, intern strings once.
Quartz has none of this today. The AST lives in one big AstStorage vector that never frees. Strings are malloc’d one-by-one. There’s no symbol table beyond the typecheck builtin map.
The world-class fix (not necessarily this sprint, but worth designing for):
- Slab-allocated AST nodes in a single arena buf.
- String interning for all identifiers and literals (share backing store).
- Phase boundaries that drop the arena.
The pragmatic fix (this sprint):
- Find and fix the parse-phase O(n²) (Phase 3a).
- Add an
ast_free()helper that walksAstStoragevecs and releases (Phase 3b). - Fix
@cfgand gate dev modules (Phase 3c).
The pragmatic fix buys 60–70% of the win without a rewrite. The world-class fix is a future sprint after launch.
Session-start checklist
cd /Users/mathisto/projects/quartz
# 1. Verify baseline
git log --oneline -6 # should show ca2f64fc PSQ-8 at top
git status # should be clean
./self-hosted/bin/quake guard:check # "Fixpoint stamp valid"
./self-hosted/bin/quake smoke 2>&1 | tail -6 # 4/4 + 22/22
# 2. Read this document + the memory-opt project memory
cat docs/handoff/next-session-compiler-memory-phase3.md
cat /Users/mathisto/.claude/projects/-Users-mathisto-projects-quartz/memory/project_memory_optimization.md
# 3. Establish baseline memory measurement
./self-hosted/bin/quartz --no-cache --memory-stats \
-I self-hosted/frontend -I self-hosted/middle -I self-hosted/backend \
-I self-hosted/shared -I std -I tools \
self-hosted/quartz.qz > /dev/null
# Expected (from Apr 15 verification):
# [mem] parse: 42 MB
# [mem] resolve: 10498 MB (+10456)
# [mem] typecheck: 17028 MB (+6530)
# [mem] mir: 24485 MB (+7457)
# [mem] codegen: 25229 MB (+744)
# 4. Single-file baseline (the parser hot spot)
./self-hosted/bin/quartz --no-cache --memory-stats \
-I self-hosted/frontend -I self-hosted/middle -I self-hosted/backend \
-I self-hosted/shared -I std -I tools \
self-hosted/frontend/lexer.qz > /dev/null
# Expected:
# [mem] parse: 1977 MB (+1974) ← THE PARSER HOT SPOT
# 5. Fix-specific backup before touching anything
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-mem3-golden
Success criteria
Minimum viable (ship-worthy):
- Full self-compile peak RSS ≤ 15 GB on macOS (was 25 GB)
lexer.qzsingle-file parse ≤ 500 MB (was 1977 MB)- All smoke tests green
- All vec/struct/field/concurrency regression specs green
- Fixpoint verified at each commit
Target (aggressive but plausible):
- Full self-compile peak RSS ≤ 10 GB on macOS (60% reduction)
lexer.qzsingle-file parse ≤ 200 MB (90% reduction)--memory-statsreveals AST dropped after MIR lowering@cfggating works and is used
Stretch (world-class):
- ≤ 5 GB peak RSS via arena + interning (multi-sprint)
Each committed change must:
- Have
quake guardfixpoint verified (2288 ± 30 functions) - Pass
quake smoke - Pass the vec/struct/field regression sweep (18+ specs)
- Pass the concurrency sweep (4+ specs)
- Include a before/after
--memory-statssnapshot in the commit message body
Prime directives reminder (v2, Apr 12 2026)
- Highest impact, not easiest. 25 GB → <10 GB is a 2.5× DX improvement. Every subsequent compiler iteration pays the dividend. Don’t settle for 20 GB just because the easy fix is a parser one-liner.
- Research first. Read rustc’s
Bump, Zig’s arena, Go’s phase drops. Don’t reinvent intern tables. - Pragmatism ≠ cowardice. The pragmatic Phase 3 plan above is fine — we’re sequencing correctly. The “world-class” rewrite is a future sprint, not a shortcut.
- Work spans sessions. Don’t cram all three sub-phases into one session. Commit at phase boundaries. Hand off cleanly.
- Report reality. If the parser fix only gets 30% savings instead of 90%, report 30%. Don’t weasel.
- Holes get filled or filed. Phase 3 will surface other hotspots (tc_free is a no-op, resolve pass1 growth is not fully explained). File each one as its own row, don’t silent-discover.
- Delete freely. If the parse-phase fix makes the intmap cache layer redundant, delete the cache. If
@cfghandler has dead branches, delete them. - Binary discipline.
quake guard+ fix-specific backup before every compiler edit. Fix-specific name:quartz-pre-mem3a-golden,quartz-pre-mem3b-golden,quartz-pre-mem3c-golden. - Estimate honestly. Traditional 1–2 weeks ÷ 4 = 2–5 quartz days. Sessions 1/2/3 should each be a day or less if we’re calibrated.
- Corrections are calibration. If the parser O(n²) theory turns out to be wrong (maybe it’s allocator overhead, not algorithmic), update the plan and move.
Out of scope for this sprint
- Scheduler park/wake refactor (#13) — existing handoff at
docs/HANDOFF_PRIORITY_SPRINT.md. Own session. - Package manager (#21) — 2–3 weeks, entirely separate domain. Pick after Phase 3 if memory work is fully shipped.
- TGT.3 WASM backend — needs its own research + design sprint.
- Async Mutex/RwLock (#15) — blocked on #13, not on memory.
- Arena-based compiler allocator (#20) — the world-class version of this sprint’s pragmatic fix. Do after Phase 3 if RSS is still >8 GB after 3a+3b+3c.
Related open items (file or fix alongside if discovered)
- PSQ-6 —
Vec.sizereads 0 from I/O poller pthread. Separate bug class from PSQ-4. Own session. - PSQ-2 —
import progresscascade breaksstd/quake.qz. Module system load-order. Own session. send/recvPOSIX shadowing —std/ffi/socket.qzextern declarations shadow channel builtins, breaksconcurrency_spec,channel_handoff_spec,unbounded_channel_spec. Pre-existing. Own session.- B3-HARD-ERROR-FALLBACK — tighten
mir_find_field_globallyto error on ambiguous matches (the fallback that allowed PSQ-4’s silent wrong-struct reads). Defense-in-depth follow-up. - QZ0603 warning spam — full self-compile still emits
cannot resolve struct type for field access '.pop'/'.size'/'.free', using offset 0warnings despite PSQ-4 being closed. These are other flavors of the same issue. Worth auditing after memory work.