Teaching an LLM a Language It Has Never Seen

No LLM has ever seen Hew code. The language didn’t exist when any of them were trained, and the corpus is too small to matter even if it had. Ask Claude or GPT to write Hew and you’ll get something that looks like Rust with actor syntax bolted on — plausible enough to fool a human, wrong enough to fail the compiler every time.

I wanted to fix that. Not for the novelty, but because I use LLMs constantly while working on the compiler, and having one that can actually produce valid Hew would save me real time. The question was whether ~1,900 training samples and a laptop-class GPU could get there.

The actual model was released on Hugging Face as sleppistan/hew-lora-qwen3.5-4b.

The setup

The hardware is an AMD Radeon 780M — the integrated GPU in a Ryzen 8845HS. 16 GB of dedicated memory and 16GB of system memory. Not exactly an A100, but something that was on-hand to train with.

The base model is Qwen3.5-4B, a 4-billion parameter model that fits comfortably in 16 GB at bfloat16 precision. I chose it because it’s the largest model I can train on this hardware without quantization — and quantization turned out to be a non-starter anyway. (The bitsandbytes library uses the system HIP runtime, which segfaults under GPU passthrough on this particular VM. I spent an afternoon figuring that out before giving up and loading the model in bf16 on CPU, then moving it to the GPU through PyTorch’s bundled HIP. It works. Don’t ask me why.)

The training uses QLoRA — well, just LoRA since I can’t quantize. Rank 16, alpha 32, dropout 0.05, targeting all the attention and MLP projections. The adapter adds about 21 million trainable parameters on top of the frozen 4.2 billion. Each training run takes 3–4 hours.

The data

The initial training corpus came from a few sources: the Hew language documentation, example programs from the compiler test suite, synthetic examples generated by Claude (with me fixing every one that didn’t compile), and hand-written examples covering specific language features.

Every sample follows ChatML format with a system prompt that encodes the key Hew rules — use Int not i32, extern calls need unsafe, actor sends move values, generators use gen fn, no range patterns in match. The system prompt evolved with each training round as I discovered what the model kept getting wrong.

The critical piece: every code sample in the training data is validated by the Hew compiler before it goes in. hew check runs on every example, and if it doesn’t pass, it doesn’t ship. This sounds obvious but it’s the entire strategy — the compiler is the test suite.

The loop

The training pipeline is iterative:

Train a LoRA adapter
Convert it to GGUF format for llama.cpp
Load it into llama-server
Run 39 eval prompts across 12 categories — actors, supervisors, generators, wire types, state machines, algorithms, concurrency, data structures, pattern matching, extern FFI, tests, and real-world patterns
Send each generated response through hew check
Categorize the failures
Write targeted correction examples for each failure pattern
Go back to step 1

Each eval prompt asks the model to write a complete Hew program. The response gets stripped of markdown fences and think tags, written to a temp file, and fed to the compiler. Pass or fail, that’s it. No human judgment about code quality — just “does it compile.”

What the model gets wrong

The failure patterns are remarkably consistent across runs. The model has strong priors from its pretraining on Rust, Go, and other languages, and those priors are wrong for Hew in specific, predictable ways.

Invented APIs. The model confidently calls .iter(), .sum(), .insert(), and .contains() on Vec — methods that exist in Rust but not in Hew. It invents std::os::args() for command-line argument parsing. It calls Vec::new() instead of using [] literals. Each of these requires correction examples showing the Hew way: for loops for iteration, push/pop/len and indexing for vectors, no CLI args API.

Type coercion. The model reaches for i32 because that’s the default integer type in most languages. In Hew, Int is i64, and integer literals are i64 by default — writing fn foo() -> i32 { 42 } fails because the literal 42 is an i64 and Hew won’t implicitly coerce it down.

Generator syntax. The model writes fn fibonacci() -> Int with yield inside, which fails because yield is only valid inside gen fn. The fix is always the same — gen fn fibonacci() -> Int — but the model has to unlearn the intuition that fn is fn.

State machine transitions. This one took a few rounds to fix. The model kept writing on GoGreen: from Red -> Green { "green light" } with a spurious from keyword. The correct syntax is on GoGreen: Red -> Green { Green } — the body must return the target state, not a string.

Enum construction. Hew uses struct-like braces for data-carrying enum variants: JsonValue::Bool { val: 1 }. The model writes JsonValue::Bool(1) because that’s how Rust does it. The definition syntax is different too — Bool { val: Int; }; not Bool(Int).

Move semantics. Actor message sends move values, so actor.send(val); other_actor.send(val) is a use-after-move error. The fix is let v1 = val; let v2 = val; before the sends. The model doesn’t anticipate this.

Prose leaking. Even with a system prompt that says “Output ONLY the code,” the model sometimes produces an explanation paragraph before the code block. The eval strips markdown fences now, but for a while those explanation paragraphs were silently causing every “unexpected character at line 1” failure.

The results

Seven training runs, v7 through v12 (v10 was a numbering artifact), over about a week:

Version	Samples	Config	Pass Rate
v7	1,827	1 epoch, r=16	53%
v8	1,866	1 epoch, r=16	69%
v9	1,898	1 epoch, r=16	68%
v10	1,926	1 epoch, r=16	67%
v11	1,926	2 epochs, r=16	80%
v12	1,926	1 epoch, r=32	71%

v8 was a big jump — that’s when I added corrections for state machines, supervisors, tests, and extern blocks. v9 and v10 added more targeted corrections for generators, invented APIs, move semantics, and enum syntax. They didn’t improve much.

The breakthrough was v11: same corpus as v10, but two training epochs instead of one. The pass rate jumped from 67% to 80%. The training loss dropped from 0.25 to 0.16 and token accuracy hit 98.1%.

v12 doubled the LoRA rank to 32, which doubled the adapter size to 163 MB, and performed worse than v11. More capacity didn’t help — more training did.

What I learned

The compiler is the only eval that matters. I could have spent time building a more nuanced evaluation — code style, idiomatic patterns, documentation quality. None of that matters if the code doesn’t compile. A binary pass/fail from the compiler is the sharpest possible signal for training data quality.

Correction examples are high-leverage early, then plateau. Going from v7 to v8, adding 39 targeted corrections for specific failure patterns produced a 16-point improvement. Going from v8 to v10, adding another 60 corrections across two rounds produced essentially nothing. The model had already absorbed the corrections it was going to absorb in one epoch.

More epochs beat more data at this scale. With ~1,900 samples, the model hadn’t fully memorized the corpus after one pass. A second epoch let it reinforce the patterns — especially the correction examples that contradict its pretraining. I suspect this stops being true at larger corpus sizes, but for a small, high-quality dataset, one epoch isn’t enough.

The system prompt is training data too. The system prompt in each training sample encodes the rules: “use Int not i32”, “no .iter() on Vec”, “generators use gen fn.” As I discovered new failure patterns, I added them to the system prompt for new samples. The model learns these rules partly from the code examples and partly from seeing them stated explicitly hundreds of times.

Pretraining priors are the enemy and the asset. The model writes plausible-looking Hew because it knows Rust. It writes wrong Hew because it knows Rust. Every correction is fighting a prior. The ones that are closest to Rust (like enum construction syntax) are the hardest to fix because the model has seen the Rust pattern millions of times and my correction example maybe a dozen.

The 20% that still fails

The remaining failures fall into three buckets. Type errors where the model reaches for APIs or patterns that Hew doesn’t support — these are the long tail of “it knows Rust, not Hew.” Move semantics errors where the model doesn’t anticipate Hew’s ownership rules for actor sends. And a handful of parse errors from the model still occasionally outputting prose instead of code.

80% compiler pass rate from a 4B model that has never seen the language in pretraining, trained on under 2,000 samples on integrated graphics — I don’t know if that’s good. I know it’s useful. The model generates valid scaffolding for actors, supervisors, state machines, and generators most of the time now, and that’s enough to speed up my own work on the compiler and standard library.

Whether it’s worth pushing further, I’m not sure yet. The next 10% probably requires either a larger base model or a fundamentally larger corpus — maybe scraping every program that’s been written in Hew and deduplicating against the training set. The last 10% might require the model to actually understand Hew’s type system, which 1,900 examples probably can’t teach.