Wire Types and Distributed Actors

Local actors are straightforward — single process, shared memory, message delivery is a pointer swap into a mailbox. The hard part is when the actor you need is on a different machine, in a different failure domain. That’s what distributed actors need to solve, and it’s been the scariest item on the roadmap since day one.

The feat/distributed-actors-v2 branch has the work. There was a v1 attempt that I won’t describe in detail because it was bad. (Bad in the “let’s pretend the network is reliable” way, which is a mistake you only get to make once before you’ve read enough Deutsch to feel embarrassed.) The v2 design started from the assumption that remote message delivery can fail, and that the type system should make that visible.

Wire types

The foundation of everything distributed in Hew is the wire type system. A wire type is a type the compiler knows how to serialize across a network boundary — not just copy across actor mailboxes within a process, but encode into bytes, send over TCP, and decode on the other side.

wire type NodeId {
    host: string,
    port: i32,
    instance: u64,
}

wire type SensorReading {
    sensor_id: string,
    value: f64,
    timestamp: i64,
}

wire type Command {
    Shutdown,
    Restart { delay: i64 },
    UpdateConfig { key: string, value: string },
}

The wire keyword does two things. First, it constrains every field to be wire-compatible — you can’t put a raw pointer, a closure, or a file handle inside a wire type. The compiler rejects it. Second, it generates serialization code. Every wire type gets both a binary serializer (MessagePack — a compact binary serialization format, and the same one the compiler’s own IR pipeline uses) and a JSON serializer, automatically.

The constraint checking is recursive. If SensorReading contains a string, the compiler verifies that string is wire-safe. It is — strings are just byte sequences. If you tried to put a &mut Vec<i32> in there, the compiler would reject it because mutable references aren’t wire-safe. Same logic as the existing Send constraint for local actor boundaries, extended to cover serialization.

wire type BadIdea {
    callback: fn(i32) -> i32,  // ERROR: closures are not wire-safe
}

Wire enums work the same way — every variant’s payload must be wire-safe. The serializer tags variants with a discriminant byte, so the deserializer on the other end knows which variant it’s reconstructing.

Serialization: two formats, one interface

Binary serialization uses MessagePack. I already had a MessagePack implementation for the compiler’s IR transport between the Rust frontend and C++ MLIR backend, so extending it to user-defined wire types was mostly plumbing. Integers pack in 1-9 bytes depending on magnitude, strings are length-prefixed UTF-8, structs serialize as MessagePack arrays in field order. No schema negotiation, no versioning — the compiler guarantees both sides agree on the layout because they were compiled from the same wire type definition.

JSON serialization exists because debugging binary protocols at 2 AM with a hex dump is nobody’s idea of a good time. Every wire type can round-trip through JSON:

import std::encoding::json

let reading = SensorReading {
    sensor_id: "temp-north-3",
    value: 22.5,
    timestamp: 1709827200000,
};

let encoded = json::encode(reading);
// {"sensor_id":"temp-north-3","value":22.5,"timestamp":1709827200000}

let decoded: SensorReading = json::decode(encoded);

The JSON serializer is generated at compile time alongside the binary one. No reflection, no runtime schema — just a pair of functions baked into the binary for each wire type. The binary format is what the runtime actually uses for inter-node messaging. JSON is for logging, debugging, and external API boundaries.

The 7-phase implementation

Distributed actors v2 was designed as seven implementation phases, because the alternative was trying to do everything at once and shipping nothing. The phases, in order:

Phase 1: Wire type checking and codegen. Make the compiler understand wire type, validate constraints, generate serializers. This landed first and works independently of everything else — you can use wire types for file I/O or API serialization without any distributed runtime.

Phase 2: Node identity and addressing. Every node in a cluster gets a NodeId — host, port, and a random instance identifier that distinguishes restarts at the same address. Actor addresses become (NodeId, ActorId) pairs. Local actors still use plain ActorId; the distributed runtime wraps them transparently.

Phase 3: The I/O poller. The runtime needed non-blocking network I/O, and I didn’t want to pull in an async runtime as a dependency. So I wrote one. kqueue on macOS, epoll on Linux — the OS-level I/O notification APIs. The poller runs on a dedicated thread, monitors TCP connections, and dispatches readiness events to the scheduler. It’s about 800 lines of C with #ifdef branches for the two backends.

typedef struct hew_poller {
    int fd;                    // kqueue fd or epoll fd
    hew_poll_event* events;    // readiness buffer
    int max_events;
    int num_connections;
} hew_poller;

int hew_poller_create(hew_poller* poller, int max_events) {
#if defined(__APPLE__)
    poller->fd = kqueue();
#elif defined(__linux__)
    poller->fd = epoll_create1(EPOLL_CLOEXEC);
#endif
    poller->events = calloc(max_events, sizeof(hew_poll_event));
    poller->max_events = max_events;
    poller->num_connections = 0;
    return poller->fd >= 0 ? 0 : -1;
}

(Writing cross-platform I/O polling is one of those things that sounds straightforward until you discover that kqueue and epoll have subtly different edge-triggered semantics and your connection handler works perfectly on macOS and drops every third message on Linux.)

Phase 4: Cluster formation and node discovery. Nodes find each other through a seed list — you start a node with --seeds 10.0.1.2:9100,10.0.1.3:9100 and it connects to those addresses, exchanges node metadata, and builds a membership table. No consensus protocol — no Raft or Paxos (distributed agreement algorithms). Just a gossip protocol that propagates membership changes with crashing detection via TCP keepalives. This is deliberately simple. Partition tolerance and strong consistency are future problems — right now I needed nodes that can find each other and exchange messages.

// Starting a distributed Hew program
// Node A
fn main() {
    let cluster = Cluster::join(
        bind: "0.0.0.0:9100",
        seeds: ["10.0.1.2:9100"],
    );

    let worker = spawn Worker(config) on cluster.any_node();
    worker.process(task);
}

Phase 5: Remote spawn. spawn Actor(args) on node creates an actor on a remote node. The arguments must be wire-safe — the compiler checks this at the call site. The runtime serializes the arguments, sends them to the target node, which deserializes and spawns locally. The caller gets back a remote actor handle that looks identical to a local one. Message sends through the handle are serialized and dispatched over TCP.

wire type WorkerConfig {
    batch_size: i32,
    timeout: i64,
}

actor Worker {
    config: WorkerConfig,

    receive fn process(task: Task) -> Result {
        // runs on the remote node
        do_work(task)
    }
}

// Spawn on a specific node
let node = cluster.node("worker-pool-1");
let w = spawn Worker(config) on node;

// This message crosses the network transparently
let result = w.process(my_task);

The on keyword after spawn triggers a completely different code path in the compiler. Instead of allocating an actor struct and pushing it to the local scheduler, it emits a remote spawn request that serializes the wire-safe constructor arguments, opens or reuses a TCP connection to the target node, and waits for an acknowledgment containing the remote actor’s address.

Phase 6: Remote messaging. Once you have a remote actor handle, message sends need to serialize arguments, send them over the connection, and deserialize the response. The receive function’s parameter types must all be wire-safe. The compiler enforces this — if you declare receive fn process(task: Task) and Task isn’t a wire type, the program won’t compile when used across node boundaries.

Phase 7: Failure detection and supervision across nodes. This is the phase that’s still unfinished. Local supervision trees work — a supervisor restarts crashed children following its configured strategy. But when a child actor is on a remote node and that node goes down, the supervisor needs to know. The gossip protocol detects node departure, but wiring that into the supervision tree’s restart logic is incomplete. Right now a remote actor crash is detected but the supervisor just logs it instead of restarting.

What worked

The wire type system was the right foundation. Having the compiler verify serialization safety at compile time eliminates an entire category of runtime errors — the “I tried to serialize a closure and got garbage” class of bugs that plagues dynamically-typed actor systems. Erlang handles this at the VM level with term-to-binary; Hew handles it before the code runs.

Remote spawn with transparent handles worked better than I expected. The fact that w.process(task) compiles to the same syntax whether w is local or remote means you can move actors between nodes without changing calling code. The runtime dispatches appropriately based on whether the handle points to a local or remote actor. And the I/O poller underneath it all turned out surprisingly solid for 800 lines of platform-specific C with no async runtime dependency.

What didn’t

Gossip-based membership is too eventual for supervision. When a node crashes, it takes several gossip rounds before the departure propagates — seconds, not milliseconds. For supervision decisions, you want to know immediately. I’ll probably need direct TCP health checks between supervisors and their remote children, bypassing the gossip protocol entirely.

The JSON serializer generates more code than I’d like. Every wire type gets both binary and JSON serialization functions compiled into the binary, even if the program never uses JSON. Dead code elimination should handle this eventually, but right now it bloats binary size noticeably for programs with many wire types.

Phase 7 is the gap that matters. Distributed actors without distributed supervision means the easy failures are covered but the hard ones aren’t. A remote actor that crashes gets detected, but the restart happens… nowhere. The supervisor sees the failure, logs it, and moves on. That’s not acceptable for production use, and it’s the next thing I’m working on.

What’s still unfinished

Distributed supervision restart logic is the big one — that’s what makes this production-ready or not. After that, backpressure across network boundaries, because local mailbox overflow policies don’t translate cleanly when the producer is on a different machine. And wire type versioning for rolling deployments, where different nodes might be running different versions of the same wire type definition.

The branch has 30-odd commits and is going to need a serious cleanup before it’s ready to merge. Some of the early phase 4 code assumed reliable ordered delivery from the gossip layer, which is exactly the mistake I said I wouldn’t make. It mostly works because TCP is mostly ordered, but that’s not good enough for production.