Porting Karpathy's MicroGPT to Hew

· 7 min read
MicroGPT: Part 1 of 3

I needed a program that would hurt. Not a toy — something with real floating-point math, string processing, mutation, nested data structures, and enough moving parts that if the compiler had gaps, the program would find them. Karpathy’s microgpt.py is 120 lines of Python that trains a character-level GPT on a list of names and generates new ones. It exercises everything: autograd, matrix math, softmax, Adam optimizer, tokenization, random sampling. If Hew can train a GPT, it can probably handle your web server.

The port took one day. The result was 1,092 lines. That’s a 9x expansion from Python, and at least a third of it was working around compiler bugs.

The tape autograd

The core abstraction in microgpt.py is a Value class with operator overloading — you write a + b and it builds a computation graph for backpropagation. Hew doesn’t have operator overloading, so the autograd uses explicit function calls: tape.vadd(a, b) instead of a + b. That’s expected expansion.

What wasn’t expected: Vec<Vec<f64>> doesn’t compile. The backend chokes on nested generics — the codegen emits a pointer type where MLIR expects a fixed-width integer. This meant no 2D arrays, no matrix-of-vectors, no clean tensor representation. Every matrix is a flat Vec<f64> with manual row-major indexing — m[row * cols + col] everywhere.

The tape itself stores its computation graph as seven parallel arrays:

type Tape {
    data: Vec<f64>;      // forward values
    grad: Vec<f64>;      // gradient accumulators
    left: Vec<Int>;      // left child index (-1 = none)
    right: Vec<Int>;     // right child index (-1 = none)
    lgrad: Vec<f64>;     // local gradient for left child
    rgrad: Vec<f64>;     // local gradient for right child
}

Every “value” in the computation graph is an Int index into these parallel arrays. Operations append to the tape and record their children and local gradients. Backward walks the tape in reverse — topological sort is free because tape indices are already in creation order.

This design was forced by Vec<Vec<T>> not compiling, but it turns out to be how real ML compilers represent computation graphs anyway. (I’d like to say I planned that. I didn’t.)

Hand-rolling math

Hew didn’t have a std::math module yet. The language compiles to LLVM, so the hardware instructions exist — llvm.exp.f64, llvm.log.f64, llvm.sqrt.f64 — but nobody had written the thin wrappers to expose them. So I wrote math from scratch.

97 lines of Taylor series for exp, log, pow, sqrt. Here’s exp:

fn exp(x: f64) -> f64 {
    if x > 709.0 { return 1.0e308; }
    if x < 0.0 - 709.0 { return 0.0; }
    let ln2 = 0.6931471805599453;
    let n = floor(x / ln2);
    let r = x - n * ln2;
    // 16-term Taylor expansion, then scale by 2^n
    // ...
    result * power
}

Range reduction, overflow guards, 16-term Taylor expansion around ln(2). It works. It’s also slower than a single fexp instruction and accumulates rounding errors that make the training loss drift from Python’s — negligible for this use case, but enough to make debugging harder when you’re trying to verify the port is correct.

Notice 0.0 - 709.0 instead of -709.0. Unary negation didn’t parse. Every negative number in the file is written as 0.0 - x. Fourteen times.

The MT19937 PRNG

No std::random either. The GPT needs gaussian-distributed random numbers for weight initialization, Fisher-Yates shuffle for document ordering, and weighted random sampling for inference. Python uses the Mersenne Twister PRNG, and I needed bit-identical output to verify the port was correct.

230 lines implementing CPython’s exact MT19937 — the init_by_array seeding algorithm, the twist function, Box-Muller gaussian transform, and bisect-based weighted sampling. With seed 42, the first 10,000 random values match Python’s output exactly.

Struct mutation and the Vec hack

A separate compiler bug made the PRNG worse than it needed to be. Function parameters in Hew are immutable, and struct field assignment through function parameters doesn’t work — fn twist(mt: MersenneTwister) can’t mutate mt.index. But Vec elements are mutable through parameters, because .push() and v[i] = val go through heap pointers that bypass the immutability check. The workaround was wrapping every scalar field in a single-element Vec<Int>, turning mt.index into mt.index[0] at every access site.

The tokenizer bug

This one cost me two hours. The GPT trains on character tokens — each unique character in the input gets an integer ID. The tokenizer needs to build a sorted vocabulary and map characters to IDs. Simple enough.

The insertion sort on Vec<String> was producing duplicate entries and missing characters. The trained model was generating garbage because half the vocabulary mapped to the wrong characters.

Two bugs compounded. First: Vec<String>.get() returns a reference, not a copy. When the sort shifts elements rightward, it overwrites the position where the key value was read from — the key mutates under you. Fix: "" + uchars[i] to force a deep copy before shifting.

Second: Hew didn’t have break or continue. The standard workaround for breaking out of a while loop was a done flag — var done = 0; while j >= 0 && done == 0 { ... done = 1; }. But setting done = 1 inside the loop body also skipped the final decrement of j, leaving the insertion index off by one. The key went into the wrong position.

Two language gaps — no reference/copy distinction in Vec accessors, and no break statement — combined into a bug that corrupted the entire vocabulary. (The kind of bug where you stare at correct-looking sort code for an hour before realizing the language is undermining your assumptions.)

First run

At 12:43pm on March 1, the initial port compiled and ran:

Loading dataset...
num docs: 32033
vocab size: 27
num params: 4192
Training for 200 steps...
step   1 / 200 | loss 3.36597
step 200 / 200 | loss 2.59736
--- inference (new, hallucinated names) ---
sample 1: jamin
sample 2: aorti
sample 3: jadin

1,092 lines of Hew. 97 lines of hand-rolled math, 230 lines of hand-rolled PRNG, and about a dozen workarounds for compiler bugs scattered through the rest. It trained, it generated plausible names, and it found more bugs in the compiler than any test suite had. The next post catalogues all thirteen of them.