Next Session — Compiler Memory Optimization Phase 3

Baseline: ca2f64fc (PSQ-8 closed, 2288 functions, fixpoint verified, smoke 4/4 + 22/22 green) Target: macOS self-compile 25 GB → <8 GB peak RSS (3.1x reduction, working target) Scope: Multi-session — three sub-phases, each ~0.5–1.5 quartz sessions Prime directive: #1 (highest impact) — this is a foundational developer-experience win that makes self-compile viable on 16 GB machines again

Why this is the next big chunk

The roadmap has four candidates for “next big chunk”:

Option	Effort	Impact	Why not
Scheduler park/wake refactor (#13)	2–3d	Closes 4 specs + enables #15 async locks	Existing handoff in `docs/HANDOFF_PRIORITY_SPRINT.md`; smaller chunk
Package Manager (#21)	2–3w	”THE post-launch priority”	Big domain shift away from compiler work; user’s recent momentum is compiler polish
TGT.3 Direct WASM Backend	10 phases	Canvas Site demo, browser target	Requires upfront research sprint; higher risk of wasted effort without a plan
Compiler Memory Phase 3	1–2w	25 GB → <8 GB on macOS, 2–3× faster self-compile	— this is the pick

Why Phase 3:

Real problem, real data. Verified Apr 15, 2026 (after PSQ-8 landed): quartz --memory-stats on a full self-compile shows 25,229 MB peak RSS on macOS. The Phase 2 number of 12.7 GB was from the mimalloc-linked build, but mimalloc was disabled on macOS in Batch A (Apr 14) due to the dyld TLS-slot race (see docs/PLATFORMS.md). The user runs Apple Silicon. Every self-compile is 35 seconds of swap-thrashing on a 16 GB machine.

Concrete phase-by-phase breakdown already in hand. Running --memory-stats on trunk HEAD:

[mem] init:    1 MB  current                         (0ms)
[mem] lex:     3 MB  (+2)                            (1ms)
[mem] parse:   42 MB (+39)                           (13ms)
[mem] resolve_pass1:  10462 MB, 82 modules
[mem] resolve_pass2:  10495 MB
[mem] resolve: 10498 MB (+10456)                     (2608ms)
[mem] typecheck: 17028 MB (+6530)                    (11521ms)
[mem] tc_free:   17028 MB (+0)                       (11521ms)
[mem] mir:       24485 MB (+7457)                    (32342ms)
[mem] codegen:   25229 MB (+744)                     (35579ms)

The three big spenders: resolve (+10.4 GB, largest), mir (+7.5 GB, second), typecheck (+6.5 GB, third). Codegen is already cheap (+744 MB). tc_free is a no-op despite its name — not actually freeing anything.

Single-file experiment confirms parser hot spot. Compiling self-hosted/frontend/lexer.qz alone (1799 LOC, 4 modules imported) burns 1977 MB in the parse phase. That’s 1.1 MB per source line, which is pathological. Something in the parser (or resolver for the imports) grows O(n²) or worse.
Already-calibrated gains from Phase 1+2. Phase 1 was tc_mangle intmap caching (eliminated 12M allocations in typecheck). Phase 2 added 33 more cache sites across MIR and resolver. Same pattern should work for parse/resolve, plus we have the phase-based freeing lever that Phase 1+2 didn’t touch.
Unblocks everything. A 3× faster self-compile (35s → ~12s) is a 3× speedup on every compiler iteration. Over a 2-week sprint on any downstream feature (WASM backend, package manager, async locks), that’s hours of saved wall time. This is the highest-leverage quality-of-life improvement available.

Phase 3a: Parser / Resolve memory hot spot (~0.5–1 session)

Target: Reduce resolve phase from 10498 MB to ~2000 MB. Reduce single-file lexer.qz parse from 1977 MB to <200 MB.

What we know

Lex itself is cheap: 3 MB for all 82 self-hosted modules (the lexer produces token streams, not strings).
Parse on a single file (lexer.qz, 1799 LOC) burns 1.9 GB. The single-file parse phase is the culprit, not module loading.
Per-LOC cost: ~1.1 MB. The roadmap item #19 specifically flags this: “lexer.qz uses 2 GB for parsing alone (300 if-blocks in while loop)”. Note: the “300 if-blocks” hint is from project memory (6 days old) — verify against current frontend/lexer.qz and frontend/parser.qz.
Resolve then scales that up to 10.5 GB across 82 modules. Resolve’s pass1 reads 10,462 MB; pass2 adds only 33 MB. So pass1 is the spender.

What to investigate first

Profile the parser on lexer.qz in isolation. The memory hotspot is almost certainly an O(n²) loop in the parser accumulating state. Candidates ranked by likelihood:
- String concatenation in token buffering — if each token appends to a growing string with s + t, that’s O(total bytes²).
- Vec push with implicit reallocation — if Vec growth doubles without releasing old storage (unlikely given vec.c implementation, but verify).
- AST node slot-value strings — if every AST node allocates a fresh string for str1/str2 instead of reusing interned symbols, multiply by ~30k nodes = 30k strings per module.
- Nested if-elif chains building long arms — each arm’s AST potentially heap-allocated separately.
Instrument ps_* functions with allocation counters. Before optimizing, measure. Add a cheap counter (static int) to each parse entry-point, print on exit, and correlate with memory growth. Or just sprinkle eputs("{pos=#{pos}, rss=#{mem_peak_rss() / 1048576} MB}") at each major loop iteration in the suspected hot spot.
Resolve pass1 is growing 10 GB across 82 modules — verify it’s re-parsing modules or doing something expensive once per module. Look at self-hosted/resolver.qz for the loop structure. If it’s 2 GB per small module × 82 modules, the fix is parse-level. If it’s 100 MB per module × 82 with a long tail, it’s a different bug.

Fix shape

Depends on what the profile reveals. Most likely one of:

Symbol interning at the parser level. If every IDENT token allocates a new string, switch to a Map<String, Int> symbol table keyed by byte slice — return the existing ID on hit. rustc does this with Atom/Symbol. Zig does this with StringIndexer.
String slicing instead of copying. If the parser copies substrings out of the source buffer, switch to passing (offset, length) tuples. The source buffer outlives parse; we don’t need copies.
AST node pooling. If 30k AST nodes × 16 slot fields × ~100 bytes per allocation = 48 MB per module × 82 = 4 GB, switch to slab allocation for AST nodes. rustc’s Bump arena is the reference.
Specific O(n²) loop. If the profile fingers one function, fix it in place.

Measurement

Before: ./self-hosted/bin/quartz --no-cache --memory-stats ... self-hosted/frontend/lexer.qz > /dev/null Expected before: [mem] parse: 1977 MB (+1974) Target after: [mem] parse: <200 MB

Full self-compile before: 25,229 MB peak Target after 3a alone: ~15,000 MB peak (saving ~10 GB from the resolve phase by propagating the parse-level gain across 82 modules)

Files to read

self-hosted/frontend/parser.qz (7521 LOC)
self-hosted/frontend/lexer.qz (1799 LOC)
self-hosted/frontend/ast.qz (AST storage layout — slot allocation patterns)
self-hosted/resolver.qz (resolve pass1 structure)
self-hosted/shared/string_intern.qz (existing intern machinery we might extend)

Risks

Parser changes are high-blast-radius. Every compiler iteration touches parse. quake guard after every change. quake smoke after every change. Run the full vec_element_type_spec, builtin_arity_spec, expand_node_audit_spec sweep after each experiment.
Interning changes can break hash stability or error-message source locations. Be careful about what you replace.

Phase 3b: Phase-based AST/MIR freeing (~0.5–1 session)

Target: Reduce mir phase peak from 24,485 MB to ~14,000 MB. Reduce codegen peak from 25,229 MB to ~10,000 MB.

The leverage

After MIR is lowered, the AST is dead. Codegen reads MIR, not AST. We hold the AST alive through codegen for no reason.
After codegen is done, MIR is dead. Nothing downstream reads it.

The phase-by-phase growth shows the AST is ~7.5 GB (the MIR delta) and the MIR is probably most of the codegen baseline. Dropping the AST after MIR lowering is worth ~7 GB peak reduction. Dropping MIR after codegen saves less (codegen is the last phase) but would matter if we ever add a post-codegen phase.

Of these two, freeing the AST after MIR is the clear win. It’s plausibly the single biggest memory lever in the entire compiler.

Implementation shape

Locate AstStorage ownership. Probably one global or a field on QuartzCompiler. See self-hosted/quartz.qz top-level pipeline (the driver — main() walks phases) and self-hosted/frontend/ast.qz for the storage struct.
Confirm MIR doesn’t back-reference AST. Grep for ast_storage and ast_get_ calls inside self-hosted/backend/. After MIR lowering, these should all be zero. If any codegen function still calls ast_get_str1() on an AST handle, we need to finish the MIR lowering work before we can free — or lift those calls out.
Free the AST at the phase boundary. After mir_lower::lower_program(...) returns, call ast_free(ast_storage) (or equivalent — add the helper if it doesn’t exist). The AST storage is a giant vector of slots; freeing is just releasing the underlying bufs.
Verify the memory stat intrinsic reports the drop. The [mem] mir: line should show tc_free or a new mir_free line with a negative delta of several GB.

The `tc_free` mystery

The current memory-stats output shows:

[mem] typecheck: 17028 MB (+6530)
[mem] tc_free:   17028 MB (+0)   ← zero delta — nothing freed

tc_free is supposed to free typecheck state after typecheck completes, but it’s not actually releasing any memory. Investigate: is tc_free actually running? Is it freeing something that was already freed? Is the 6.5 GB typecheck added permanently reachable? This might be an independent leak worth fixing alongside.

Measurement

Expected after 3b:

[mem] mir:      ~15000 MB (+7500, then -7500 for AST free)  or similar
[mem] codegen:  ~15744 MB

Files to read

self-hosted/quartz.qz (compiler driver — phase sequencing)
self-hosted/frontend/ast.qz (AST storage struct + free helper if any)
self-hosted/middle/typecheck.qz (search for tc_free to understand current no-op behavior)
self-hosted/backend/mir_lower.qz (entry point — where to add the ast_free call after it returns)

Risks

If MIR still holds AST pointers, freeing the AST causes use-after-free. This is why step 2 (grep for back-references) is mandatory before step 3.
tc_free might be no-op because the state is still referenced. Fixing tc_free requires tracing what holds onto TypecheckState after typecheck completes.

Phase 3c: @cfg gating unblock (~0.5 session)

Target: Enable @cfg(feature: "...") gating on def-level items (not just import-level). Use it to gate egraph (~457 MB), lint (~103 MB), and any other dev-only modules out of a minimal self-compile.

The blocker

Project memory (6 days old, verify):

“More @cfg gating — egraph (457 MB), lint (103 MB) — blocked by @cfg-on-def SIGSEGV bug”

Somewhere, putting @cfg(feature: "xxx") above a def causes a SIGSEGV at either parse, resolve, or typecheck time. The bug is the blocker; fix it first, then the gating pays off.

Investigate

Reproduce the SIGSEGV. Write a minimal file:
```
@cfg(feature: "xyz")
def unused(): Int = 0

def main(): Int = 0
```
Compile with quartz --no-cache /tmp/cfg.qz. If it crashes, we have a minimal repro.
Find where @cfg is processed. Grep parser.qz and resolver.qz for cfg — probably an attribute handler. The handler likely assumes @cfg only appears on imports, not on defs.
Fix the handler to support def-level. Skip or elide the def when the feature is absent. Might require threading g_cfg_features through to the parser def handler.

Payoff

Once fixed, gate the dev-only modules:

@cfg(feature: "egraph")
def egraph_optimize(...)

Run the --memory-stats pipeline with and without the feature flag to measure the savings. Expected: ~500–600 MB off the peak, shaved from MIR/typecheck.

Risks

@cfg parsing bugs can produce confusing errors. Start with a single def, not a large gate.
Gating in the wrong place can break tests that depend on the gated module. Run the QSpec sweep after each gate addition.

Session sequencing

Session 1 — Phase 3a (parser hot spot)

Profile lexer.qz single-file parse
Identify the O(n²) culprit
Fix in place
Measure: single-file parse <200 MB, full self-compile <15 GB peak
Commit with fixpoint verified

Session 2 — Phase 3b (phase-based freeing)

Investigate tc_free (why is it a no-op?)
Grep MIR/codegen for AST back-references
Add ast_free call after MIR lowering
Measure: codegen phase <10 GB peak
Commit with fixpoint verified

Session 3 — Phase 3c (@cfg unblock + gating)

Minimal SIGSEGV repro
Fix the parser/resolver handler
Add @cfg(feature: "egraph") gates
Measure: feature-off build shaves ~500 MB
Commit with fixpoint verified

Total: ~3 quartz sessions (calibrated from 1–2 weeks traditional per roadmap estimate).

Sessions can run independently. 3a → 3b → 3c is the natural order but not required — 3c could slot in at any point.

Research pointers (Directive 2 — design the full correct solution)

How do the big systems languages handle this problem?

rustc: Uses Bump arenas per compilation unit, with typed arenas for different node kinds. Each phase drops its own arena. Symbol interning via Atom — every IDENT is a u32 index into a global table. ThinVec<T> for small collections to avoid header overhead. See rustc’s arena.rs and symbol.rs.
Go compiler: Phase-based AST drops. The types2 package explicitly releases *Info after typechecking. Go’s parser.File owns everything and is drop-once.
Zig: Per-function arena. Ast.zig holds all source-level info in a flat array of Node; Sema.zig translates into Zir and can free the AST. Symbol interning via InternPool.
LLVM/clang: BumpPtrAllocator for AST, drops between TUs. Clang’s ASTContext owns everything; there’s no phase freeing within a TU, but TU boundaries are hard drops.

The common pattern: arena per phase, drop at phase boundary, intern strings once.

Quartz has none of this today. The AST lives in one big AstStorage vector that never frees. Strings are malloc’d one-by-one. There’s no symbol table beyond the typecheck builtin map.

The world-class fix (not necessarily this sprint, but worth designing for):

Slab-allocated AST nodes in a single arena buf.
String interning for all identifiers and literals (share backing store).
Phase boundaries that drop the arena.

The pragmatic fix (this sprint):

Find and fix the parse-phase O(n²) (Phase 3a).
Add an ast_free() helper that walks AstStorage vecs and releases (Phase 3b).
Fix @cfg and gate dev modules (Phase 3c).

The pragmatic fix buys 60–70% of the win without a rewrite. The world-class fix is a future sprint after launch.

Session-start checklist

cd /Users/mathisto/projects/quartz

# 1. Verify baseline
git log --oneline -6                                    # should show ca2f64fc PSQ-8 at top
git status                                              # should be clean
./self-hosted/bin/quake guard:check                     # "Fixpoint stamp valid"
./self-hosted/bin/quake smoke 2>&1 | tail -6            # 4/4 + 22/22

# 2. Read this document + the memory-opt project memory
cat docs/handoff/next-session-compiler-memory-phase3.md
cat /Users/mathisto/.claude/projects/-Users-mathisto-projects-quartz/memory/project_memory_optimization.md

# 3. Establish baseline memory measurement
./self-hosted/bin/quartz --no-cache --memory-stats \
  -I self-hosted/frontend -I self-hosted/middle -I self-hosted/backend \
  -I self-hosted/shared -I std -I tools \
  self-hosted/quartz.qz > /dev/null
# Expected (from Apr 15 verification):
#   [mem] parse:   42 MB
#   [mem] resolve: 10498 MB (+10456)
#   [mem] typecheck: 17028 MB (+6530)
#   [mem] mir:       24485 MB (+7457)
#   [mem] codegen:   25229 MB (+744)

# 4. Single-file baseline (the parser hot spot)
./self-hosted/bin/quartz --no-cache --memory-stats \
  -I self-hosted/frontend -I self-hosted/middle -I self-hosted/backend \
  -I self-hosted/shared -I std -I tools \
  self-hosted/frontend/lexer.qz > /dev/null
# Expected:
#   [mem] parse: 1977 MB (+1974)   ← THE PARSER HOT SPOT

# 5. Fix-specific backup before touching anything
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-mem3-golden

Success criteria

Minimum viable (ship-worthy):

Full self-compile peak RSS ≤ 15 GB on macOS (was 25 GB)
lexer.qz single-file parse ≤ 500 MB (was 1977 MB)
All smoke tests green
All vec/struct/field/concurrency regression specs green
Fixpoint verified at each commit

Target (aggressive but plausible):

Full self-compile peak RSS ≤ 10 GB on macOS (60% reduction)
lexer.qz single-file parse ≤ 200 MB (90% reduction)
--memory-stats reveals AST dropped after MIR lowering
@cfg gating works and is used

Stretch (world-class):

≤ 5 GB peak RSS via arena + interning (multi-sprint)

Each committed change must:

Have quake guard fixpoint verified (2288 ± 30 functions)
Pass quake smoke
Pass the vec/struct/field regression sweep (18+ specs)
Pass the concurrency sweep (4+ specs)
Include a before/after --memory-stats snapshot in the commit message body

Prime directives reminder (v2, Apr 12 2026)

Highest impact, not easiest. 25 GB → <10 GB is a 2.5× DX improvement. Every subsequent compiler iteration pays the dividend. Don’t settle for 20 GB just because the easy fix is a parser one-liner.
Research first. Read rustc’s Bump, Zig’s arena, Go’s phase drops. Don’t reinvent intern tables.
Pragmatism ≠ cowardice. The pragmatic Phase 3 plan above is fine — we’re sequencing correctly. The “world-class” rewrite is a future sprint, not a shortcut.
Work spans sessions. Don’t cram all three sub-phases into one session. Commit at phase boundaries. Hand off cleanly.
Report reality. If the parser fix only gets 30% savings instead of 90%, report 30%. Don’t weasel.
Holes get filled or filed. Phase 3 will surface other hotspots (tc_free is a no-op, resolve pass1 growth is not fully explained). File each one as its own row, don’t silent-discover.
Delete freely. If the parse-phase fix makes the intmap cache layer redundant, delete the cache. If @cfg handler has dead branches, delete them.
Binary discipline. quake guard + fix-specific backup before every compiler edit. Fix-specific name: quartz-pre-mem3a-golden, quartz-pre-mem3b-golden, quartz-pre-mem3c-golden.
Estimate honestly. Traditional 1–2 weeks ÷ 4 = 2–5 quartz days. Sessions 1/2/3 should each be a day or less if we’re calibrated.
Corrections are calibration. If the parser O(n²) theory turns out to be wrong (maybe it’s allocator overhead, not algorithmic), update the plan and move.

Out of scope for this sprint

Scheduler park/wake refactor (#13) — existing handoff at docs/HANDOFF_PRIORITY_SPRINT.md. Own session.
Package manager (#21) — 2–3 weeks, entirely separate domain. Pick after Phase 3 if memory work is fully shipped.
TGT.3 WASM backend — needs its own research + design sprint.
Async Mutex/RwLock (#15) — blocked on #13, not on memory.
Arena-based compiler allocator (#20) — the world-class version of this sprint’s pragmatic fix. Do after Phase 3 if RSS is still >8 GB after 3a+3b+3c.

PSQ-6 — Vec.size reads 0 from I/O poller pthread. Separate bug class from PSQ-4. Own session.
PSQ-2 — import progress cascade breaks std/quake.qz. Module system load-order. Own session.
send/recv POSIX shadowing — std/ffi/socket.qz extern declarations shadow channel builtins, breaks concurrency_spec, channel_handoff_spec, unbounded_channel_spec. Pre-existing. Own session.
B3-HARD-ERROR-FALLBACK — tighten mir_find_field_globally to error on ambiguous matches (the fallback that allowed PSQ-4’s silent wrong-struct reads). Defense-in-depth follow-up.
QZ0603 warning spam — full self-compile still emits cannot resolve struct type for field access '.pop'/'.size'/'.free', using offset 0 warnings despite PSQ-4 being closed. These are other flavors of the same issue. Worth auditing after memory work.

Next Session — Compiler Memory Optimization Phase 3

Why this is the next big chunk

Phase 3a: Parser / Resolve memory hot spot (~0.5–1 session)

What we know

What to investigate first

Fix shape

Measurement

Files to read

Risks

Phase 3b: Phase-based AST/MIR freeing (~0.5–1 session)

The leverage

Implementation shape

The tc_free mystery

Measurement

Files to read

Risks

Phase 3c: @cfg gating unblock (~0.5 session)

The blocker

Investigate

Payoff

Risks

Session sequencing

Research pointers (Directive 2 — design the full correct solution)

Session-start checklist

Success criteria

Prime directives reminder (v2, Apr 12 2026)

Out of scope for this sprint

Related open items (file or fix alongside if discovered)

The `tc_free` mystery