P.5 — String Interning Plan

Status: Design document — no compiler source modifications Target: Reduce self-compilation time from ~15.6s to <5s via string allocation elimination Date: Feb 28, 2026

1. Profiling Results

Baseline Measurement

Self-compilation: ~63,794 lines across 48 .qz source files
Wall time:   15.585s
User time:    9.879s
System time:  1.619s (11.5s CPU, ~3.5× slower than ROADMAP's 8.4s target)

The ROADMAP records 8.4s after the length-prefixed string optimization (P.5-LP). The current measurement of ~15.6s likely reflects additional compiler complexity added since that measurement (intersection types, safety audit, stress tests, derive macros, etc. — the compiler grew from ~1,357 functions to ~1,466+ functions).

Compiler Pipeline Phase Structure

From quartz.qz, the compilation pipeline for a self-compile invocation processes:

Phase	Description	Bottleneck Hypothesis
1	Lexer — tokenize source file	Moderate: 1 string allocation per token
2	Parser — build AST	Heavy: 12 parallel vectors per node, str1+str2 allocations
2.4	Derive — expand @derive annotations	Light
2.5	Macros — expand user macros	Light
3	Resolver — import resolution	Heavy: re-lex + re-parse each of ~35+ imported modules
3.5	Build cache — content hash check	Light (SHA-256 per source file)
4.0a-e	Type registration — structs/enums/traits/impls/consts/globals	Moderate
4.1	Signature registration — function signatures	Moderate
4.2	Liveness — NLL pre-pass	Moderate
4.3	Typecheck — `tc_program` + `tc_function` for all modules	Heavy: ~3,167 lines typecheck.qz + ~5,763 lines typecheck_walk.qz
5	MIR lowering — `mir_lower_all`	Heavy: ~6,021 lines mir_lower.qz
5.5	MIR optimization — e-graph optimizer	Moderate
6	Codegen — emit LLVM IR	Heavy: ~2,070 + ~8,499 lines codegen + codegen_intrinsics

String Allocation Hotspots

The compiler’s dominant allocation pattern is strings. Every identifier, type name, function name, and error message is a heap-allocated String (length-prefixed since P.5-LP).

Key allocation sites:

Source	`vec_new<String>` calls	`.push(string)` calls	Per-invocation estimate
`lexer.qz`	1	28	~N tokens × 4 vectors (but lexemes Vec)
`parser.qz`	6	105	~N nodes × str1/str2 pushes
`ast.qz`	3	927 (total pushes)	12 pushes per AST node — 2 are String vectors
`typecheck.qz`	48	—	Registries: struct names, field names, enum names, etc.
`mir.qz` + `mir_lower.qz`	40	—	Variable bindings, instruction names, string pools
`codegen.qz`	0	—	Builds IR strings (StringBuilder)
`resolver.qz`	7	—	Module paths, loaded paths, prefixed names
`quartz.qz`	12	—	CLI, import paths, cache hashes

Estimated string allocations per self-compile:

The compiler sources total ~63,794 lines. Estimating:

~200,000 tokens → ~200,000 lexeme string allocations (one per token)
~60,000 AST nodes → ~120,000 str1/str2 string pushes (2 per node)
TypecheckState registries: ~50 Vec<String> vectors, each growing during compilation
MIR: variable names, instruction labels, string pool entries

Conservative estimate: 400,000–600,000 string allocations during self-compilation.

Among these, a large fraction are duplicate identifier strings. The same identifiers (Int, String, Vec, Bool, Void, var, def, end, function names, type names) appear thousands of times and are allocated fresh each time.

Deduplication Potential

Common identifiers across the compiler source:

Keywords: def, end, if, else, return, var, while, for, in — appear in ~90% of lines
Built-in types: Int, String, Bool, Void, Vec, HashMap — thousands of occurrences
Field names: size, push, get, set — repeated per node access
Operator strings: +, -, *, /, ==, != — appear in codegen string building

If 60–70% of string allocations are duplicates (conservative for a compiler that processes its own source), interning would eliminate 250,000–400,000 allocations.

2. String Interner Design

API

# A string interner returns integer handles instead of heap-allocated strings.
# The interner owns all string data; handles are just indices.

newtype InternId = Int

struct StringInterner
  strings: Vec<String>       # handle → string (indexed lookup)
  lookup: HashMap<String, Int>  # string → handle (dedup lookup)
end

def intern_new(): StringInterner
  return StringInterner {
    strings: vec_new<String>(),
    lookup: hashmap_new<String, Int>()
  }
end

## Intern a string. Returns existing handle if already interned,
## otherwise stores the string and returns a new handle.
def intern(interner: StringInterner, s: String): InternId
  var existing = interner.lookup.get(s)
  if existing >= 0
    return existing
  end
  var id = interner.strings.size
  interner.strings.push(s)
  interner.lookup.set(s, id)
  return id
end

## Resolve an interned handle back to its string.
def intern_resolve(interner: StringInterner, id: InternId): String
  return interner.strings[id]
end

Storage Model

StringInterner
├── strings: Vec<String>          # Dense array: handle → string value
│   [0] = "Int"
│   [1] = "String"
│   [2] = "Bool"
│   [3] = "main"
│   [4] = "def"
│   ...
└── lookup: HashMap<String, Int>  # Hash map: string → handle
    "Int" → 0
    "String" → 1
    ...

Characteristics:

O(1) amortized intern (HashMap lookup + optional insert)
O(1) resolve (Vec index)
Memory: one copy of each unique string + HashMap overhead
Thread-unsafe (single-threaded compiler — fine)

Integration Points

The interner would be a global singleton passed through the pipeline. Every phase that currently stores String identifiers would instead store InternId handles.

3. Migration Strategy

Phase 1: Lexer Integration (Highest Impact)

Goal: Intern all identifier and keyword tokens at lex time.

Current (lexer.qz):

LexerResult {
  types: Vec<Int>,      # token types — keep as-is
  lexemes: Vec<String>, # ← each token is a heap-allocated string
  lines: Vec<Int>,      # line numbers — keep as-is
  cols: Vec<Int>        # column numbers — keep as-is
}

After:

LexerResult {
  types: Vec<Int>,
  lexemes: Vec<InternId>,  # ← intern IDs instead of strings
  lines: Vec<Int>,
  cols: Vec<Int>,
  interner: StringInterner # ← shared interner
}

Files changed: lexer.qz Impact: Eliminates ~200,000 string allocations, reduces Vec to Vec

Phase 2: Parser/AST Integration

Goal: Store interned IDs in AST string slots (str1, str2, str3).

Current (ast.qz):

struct AstStorage
  ...
  str1s: Vec<String>   # ← primary names (identifiers, type names)
  str2s: Vec<String>   # ← secondary names (type annotations)
  str3s: Vec<String>   # ← tertiary (type params)
  ...
end

After:

struct AstStorage
  ...
  str1s: Vec<InternId>   # handles into shared interner
  str2s: Vec<InternId>
  str3s: Vec<InternId>
  ...
end

Files changed: ast.qz, parser.qz Impact: Eliminates ~120,000 AST string allocations Risk: All downstream consumers of ast_get_str1() etc. must be updated to resolve via interner

Phase 3: Typecheck Registry Integration

Goal: Use InternId for type/function name lookups.

Current (typecheck.qz / typecheck_util.qz):

struct TcRegistry
  struct_names: Vec<String>
  enum_names: Vec<String>
  func_names: Vec<String>
  ...
end

After: Store Vec<InternId> and use integer comparison instead of string comparison for name lookups. This changes name resolution from O(n × strlen) to O(n × 1).

Files changed: typecheck.qz, typecheck_util.qz, typecheck_registry.qz, typecheck_builtins.qz, typecheck_walk.qz Impact: Faster type lookups, reduced memory churn

Phase 4: MIR & Codegen Integration

Goal: Use InternId for variable bindings and function references in MIR.

Files changed: mir.qz, mir_lower.qz, codegen.qz, codegen_util.qz, codegen_intrinsics.qz Risk: Codegen builds LLVM IR strings using actual string values — this phase needs the interner to resolve IDs back to strings for IR emission. The interner speeds up everything before final IR emission.

Recommended Order

Phase 1: Lexer — highest impact, lowest risk, self-contained
Phase 2: AST — requires updating all ast_get_str1/str2/str3 callers
Phase 3: Typecheck — large surface area but mechanical
Phase 4: MIR/Codegen — lowest priority, most complex

Each phase should be independently buildable and fixpoint-verifiable.

4. Risk Assessment

Fixpoint Implications

The interner changes the internal representation but not the output. The compiler still emits identical LLVM IR. However:

Order sensitivity: The interner assigns sequential IDs. If the intern order changes between gen1 and gen2, the compiler’s internal state differs, but the output IR should be identical because IDs are always resolved back to strings for codegen.
Two-stage bootstrap: Like the length-prefixed string migration (P.5-LP), this may require gen0 → gen1 → gen2 → gen3 with gen2==gen3 verification, since gen0 (current compiler) doesn’t have the interner but gen1 does.
HashMap determinism: The interner’s HashMap must produce deterministic iteration order. Since we use intern() for lookup and intern_resolve() for output, and codegen resolves IDs to strings, the output is independent of HashMap order.

Breaking Changes

ast_get_str1() return type changes from String to InternId (or we keep a wrapper that auto-resolves — safer but slower)
Every file that calls ast_get_str1() etc. needs updating
The interner must be thread-safe or single-threaded — currently fine since the compiler is single-threaded

Mitigation

Incremental migration: Use a wrapper ast_get_str1_resolved(store, handle, interner) that resolves automatically. Gradually migrate callers to use raw InternId.
One phase at a time: Each phase should build + test + fixpoint before proceeding.
Escape hatch: If a phase breaks, revert and use the previous binary.

5. Estimated Impact

Optimization	Est. Allocations Eliminated	Est. Time Reduction
Lexer interning	~200,000	2–3s
AST interning	~120,000	1–2s
Typecheck interning	~50,000	0.5–1s
MIR interning	~30,000	0.3–0.5s
Total	~400,000	4–7s

If self-compilation drops from ~15.6s to ~8–11s, that’s a meaningful improvement. Combined with other optimizations (arena allocation for AST nodes, Vec preallocation), the <5s target may be achievable.

Additional Optimizations (Not In Scope)

These complement interning but are separate work items:

Arena AST allocation: Instead of 12 parallel Vecs with individual pushes, use one contiguous arena with fixed-size AST node records.
StringBuilder pooling: Reuse StringBuilder instances instead of creating new ones.
Vec preallocation: vec_new_with_capacity<String>(estimated_tokens) avoids resizing.
Lazy module loading: Don’t re-lex/re-parse unchanged modules (this is P.2 incremental compilation).
Parallel module compilation: Lex/parse modules in parallel (blocked by single-threaded interner).