Running Valgrind on Everything

Eight bytes. That’s how much the reply channel was leaking per hew_actor_ask call — a single pointer-sized allocation, loaded by codegen, never freed. No test caught it, and no program crashed. You could run actor programs for hours and never notice. But multiply eight bytes by every ask call in a long-running service, and you’ve got the kind of leak that shows up six months later in a production OOM report with no stack trace.

I’d been telling myself that if the tests pass and the programs run, the memory is probably fine. (I’ve thought this about a lot of projects. It’s rarely true.)

The setup

Straightforward setup: compile twelve representative programs spanning the major runtime features — actors, supervisors, generators, closures, lambda actors, stress tests — and run each under valgrind with --leak-check=full. First problem: valgrind couldn’t produce useful symbol traces because the linker was passing --strip-all. Removed that flag temporarily, got readable traces, and the picture was not what I expected.

The clean side

Non-actor programs came back perfect. Zero leaks, every allocation freed. The arena allocator, Vec free, HashMap free, string drop — all clean. Those codegen paths had been built incrementally across dozens of commits, and valgrind confirmed every malloc had a matching free. I would have bet the actor side was cleaner. (I would have lost that bet.)

Eight bytes at a time

The reply channel leak was the simplest fix and the most instructive. hew_actor_ask allocates a buffer, the runtime writes the reply into it, codegen loads the reply value out. Nobody freed the buffer. The generated MLIR loaded the value and moved on. Fix was a single free(replyPtr) in ActorAskOpLowering after the load. Eight bytes per ask call, invisible to every test, guaranteed to accumulate.

Supervisor child specs were a similar pattern. hew_supervisor_add_child_spec uses strdup for the child name and malloc for the init state copy — standard C allocation for data crossing the FFI boundary. On shutdown, those allocations just… weren’t freed. The supervisor’s Box would drop, the child spec structs would drop, and the C-allocated memory would leak. Fix was a Drop impl:

impl Drop for InternalChildSpec {
    fn drop(&mut self) {
        if !self.name.is_null() {
            // SAFETY: name was allocated by libc::strdup in
            // hew_supervisor_add_child_spec and is non-null.
            unsafe { libc::free(self.name.cast()); }
        }
        if !self.init_state.is_null() {
            // SAFETY: init_state was allocated by libc::malloc
            // and memcpy'd in hew_supervisor_add_child_spec.
            unsafe { libc::free(self.init_state); }
        }
    }
}

Straightforward RAII — when the supervisor’s Box drops, the child specs drop, the C allocations free. No shutdown hook, no manual cleanup, no chance of forgetting.

Shutting down the world

Actor system shutdown was where things got complicated. Individual actor leaks are local problems — find the allocation, find where it should be freed, add the free. System-wide shutdown is coordination. Worker threads need to finish. Actors still in mailbox queues need their state freed. The scheduler itself needs to be deallocated.

The fix introduced a LIVE_ACTORS tracking set with track_actor() and untrack_actor() calls at spawn and drop. A new cleanup_all_actors() function runs after the worker threads are joined, walking the set and freeing any remaining actor state. Then hew_runtime_cleanup() — a C ABI function that codegen calls after hew_sched_shutdown() — tears down the scheduler itself. The SCHEDULER changed from OnceLock to AtomicPtr so it could actually be freed, since OnceLock is designed to be set once and never cleared.

The nested supervisor

This was the hardest bug, and it started as a double-free. (That one happened on a Thursday night, which tells you how the Friday morning went.)

The first attempt at supervisor cleanup called free_registered_supervisors(), which tried to stop each supervisor gracefully. But by the time this ran, the worker threads were already joined — there was nobody to process the stop messages. The supervisor’s spin-wait loop would deadlock, waiting for an acknowledgment that could never arrive.

So I changed the approach: just drop the Box directly, skip the graceful shutdown. That’s when the double-free appeared. The inner supervisor was also registered as a top-level supervisor, because hew_supervisor_start auto-registers when the parent pointer is null. And the parent pointer is null at start time — it only gets set when the outer supervisor calls add_child_supervisor, which happens after the child’s init function has already run and registered it.

The lifecycle is: new → start (registers as top-level) → add_child_supervisor (sets parent). Steps two and three are out of order by design — you can’t add a child until it’s started, but starting it registers it as parentless. Whether that design is right, I’m honestly not sure, but the fix was to unregister child supervisors from the top-level list inside add_child_supervisor_with_init, and to add a free_supervisor_resources() that drops supervisor memory without spin-waiting for acknowledgment. A null-out of self_actor.state before freeing prevented a second bug where libc::free would be called on what was actually a Rust Box.

This also fixed a pre-existing failure in the e2e_supervisor_nested test — a test that had been failing for reasons nobody had traced. Turns out the reason was the same lifecycle ordering issue, manifesting as a hang instead of a crash.

Making valgrind stick

I added scripts/valgrind-check.sh, which compiles seven programs and checks that valgrind reports zero definitely-lost bytes. One known limitation: closure environments in non-actor main() functions leak 128 bytes because the environment struct is heap-allocated but never freed when main exits. That’s a codegen fix for a future commit. The script documents it as a known gap.

The “possibly lost” and “still reachable” categories remain non-zero — crossbeam’s epoch-based reclamation uses thread-local data that valgrind can’t trace through, and thread-local storage is inherently “still reachable” at exit. These are known false positives from the lock-free queue implementation, not actual leaks.

==  HEAP SUMMARY:
==     in use at exit: 3,088 bytes in 12 blocks
==   total heap usage: 1,247 allocs, 1,235 frees, 198,304 bytes allocated
==
==  LEAK SUMMARY:
==    definitely lost: 1,824 bytes in 9 blocks
==    indirectly lost: 261 bytes in 3 blocks
==      possibly lost: 1,296 bytes in 3 blocks (crossbeam)
==    still reachable: 2,064 bytes in 6 blocks (thread-local)

became:

==  HEAP SUMMARY:
==     in use at exit: 1,296 bytes in 3 blocks
==   total heap usage: 1,253 allocs, 1,250 frees, 199,472 bytes allocated
==
==  LEAK SUMMARY:
==    definitely lost: 0 bytes in 0 blocks
==    indirectly lost: 0 bytes in 0 blocks
==      possibly lost: 1,296 bytes in 3 blocks (crossbeam)
==    still reachable: 2,064 bytes in 6 blocks (thread-local)

1,824 bytes definitely lost, 261 indirectly lost, one test failure. Then zero, zero, zero. I don’t know if that makes the runtime trustworthy yet, but at least valgrind stopped yelling at me.