MicroGPT — From 1,092 Lines to 677

Most of the gaps microgpt found turned out to be shallow. The compiler already had backend support for numeric conversions and math intrinsics — they just needed stdlib wrappers. Once those landed, the workarounds came out one commit at a time.

The refactoring sequence

Each commit removed one category of workaround. The order followed the dependency chain — you can’t remove hand-rolled math until the stdlib has math, and you can’t simplify the PRNG until you have .to_f64().

.to_f64() intrinsic. The compiler emits sitofp now. The 15-line int_to_f64_fast function and all its call sites disappeared. Every place that kept parallel integer and float counters collapsed to a single integer with .to_f64() where needed.

math.* stdlib. exp, log, pow, sqrt, floor, abs, max — thin wrappers around LLVM intrinsics. 97 lines of Taylor series deleted. The training loss now matches Python to 4+ significant figures because both use hardware fexp/flog instead of my approximations.

random.* stdlib. The 230-line hand-rolled Mersenne Twister became random.seed(42), random.gauss(0.0, std), random.shuffle(doc_order), and random.choices(cum_weights, cum_total, vocab_size). The stdlib implementation matches CPython’s MT19937 exactly, so the output stayed byte-identical.

break/continue and unary negation. All fourteen 0.0 - x became -x. The done-flag while-loop patterns became clean while loops with break. The tokenizer sort that had the compound corruption bug became straightforward:

while j2 >= 0 {
    let a_code = string_char_at(uchars[j2], 0);
    if a_code > key_code {
        uchars[j2 + 1] = "" + uchars[j2];
        j2 -= 1;
    } else {
        break;
    }
}

impl Tape and HashMap tokenizer. Methods moved onto the Tape type with impl Tape { ... }, so vadd(t, a, b) became t.vadd(a, b). The character-to-token lookup switched from linear search over Vec<String> to HashMap<String, Int>. Compound assignment operators (+=, -=) replaced verbose x = x + 1 patterns.

for i in 0..n and const. Hyperparameters became const declarations. Most indexed while loops became for loops. Redundant length parameters dropped from function signatures — v.len() replaced a separately-tracked n.

The final state

677 lines. The Model struct holds all nine weight matrices:

type Model {
    wte: Vec<Val>;      // token embeddings
    wpe: Vec<Val>;      // position embeddings
    lm_head: Vec<Val>;  // output projection
    attn_wq: Vec<Val>;  // query weights
    attn_wk: Vec<Val>;  // key weights
    attn_wv: Vec<Val>;  // value weights
    attn_wo: Vec<Val>;  // output weights
    mlp_fc1: Vec<Val>;  // MLP first layer
    mlp_fc2: Vec<Val>;  // MLP second layer
}

The struct feeds into gpt_forward, which takes a model, a token ID, a position, and the KV cache, and returns logits. The training loop creates a fresh tape each step — copy parameter values, run forward, compute cross-entropy loss, backpropagate, apply Adam. Inference copies the trained parameters into a fresh tape and samples autoregressively with temperature scaling.

Python compatibility

With seed 42 on the same names dataset, the first six generated names are byte-identical to Python’s output:

sample 1: jamin
sample 2: aorti
sample 3: jadin
sample 4: lira
sample 5: dede
sample 6: ede

Training losses agree to four decimal places across all 200 steps. The tiny drift after sample 6 comes from hardware math — llvm.exp.f64 vs Python’s math.exp produce different last-bit rounding that accumulates over thousands of operations. Both are correct to IEEE 754; they just break ties differently.

What’s still there

The 677-line file is 5.6x the 120-line Python original. Some of that expansion is inherent — Hew is statically typed, doesn’t have operator overloading, and requires explicit allocation. tape.vadd(a, b) will always be more verbose than a + b. The Tape struct with seven parallel arrays will always be more explicit than Python’s __add__ magic.

But some of it is still language gaps. Vec<Vec<T>> still doesn’t compile, so matrices remain flat with manual indexing. HashMap<String, Vec<Val>> hasn’t been tested, so weight matrices are named fields on a struct. And the tokenizer still does "" + uchars[j2] to deep-copy strings during sorting — Vec<String> element access semantics haven’t been clarified yet.