Benchmarks
This document describes the benchmark methodology, how to run each benchmark, and the target performance numbers.
Overview
Section titled “Overview”ZigBolt ships a comprehensive benchmark suite covering latency, throughput, codec performance, and data structure operations:
| Benchmark | Binary | What it Measures |
|---|---|---|
| Ping-Pong | bench_ping_pong | IPC round-trip latency (RTT) |
| Throughput | bench_throughput | IPC single-direction message rate |
| UDP RTT | bench_udp_rtt | UDP loopback round-trip latency |
| Codec Throughput | bench_codec_throughput | WireCodec encode/decode rate (single + batch) |
| SPSC Latency | bench_spsc_latency | SPSC ring buffer write/read latency |
| MPSC Latency | bench_mpsc_latency | MPSC ring buffer contention latency |
| LogBuffer Throughput | bench_logbuffer | LogBuffer claim/commit/read rate |
| IPC Multi-Size | bench_ipc_multisize | IPC latency across message sizes |
| Full Suite | bench_run_all | All-in-one suite with JSON output |
All benchmarks are compiled with -OReleaseFast for maximum optimization.
Building
Section titled “Building”zig build benchThis builds and runs the full suite (bench/run_all.zig). A plain
zig build compiles all individual benchmark binaries into zig-out/bin/.
To run individually:
zig build && ./zig-out/bin/bench_ping_pongzig build && ./zig-out/bin/bench_throughputzig build && ./zig-out/bin/bench_udp_rttzig build && ./zig-out/bin/bench_codec_throughputzig build && ./zig-out/bin/bench_spsc_latencyzig build && ./zig-out/bin/bench_mpsc_latencyzig build && ./zig-out/bin/bench_logbufferzig build && ./zig-out/bin/bench_ipc_multisizezig build && ./zig-out/bin/bench_run_allThe bench_run_all binary runs all benchmarks in sequence and outputs a summary
table plus a bench/results.json file for CI integration.
Methodology
Section titled “Methodology”Ping-Pong (IPC RTT)
Section titled “Ping-Pong (IPC RTT)”What: Measures the time between publishing a message into an IPC channel and immediately polling it back in the same process. This captures the raw shared memory write/read latency without cross-process scheduling overhead.
Procedure:
- Create an IPC channel (
/zigbolt_bench_pp) with 1 MB term length, pre-faulted pages - Warm up with 10,000 messages (discarded)
- Recreate the channel to start clean
- For each of 100,000 measurement iterations:
- Record
send_timeviatimestampNs() - Publish a 32-byte message containing the timestamp
- Poll it back immediately
- Record
recv_time, computertt = recv_time - send_time - Add RTT to HDR histogram
- Record
- Report percentiles: min, mean, p50, p90, p99, p99.9, p99.99, max
Configuration:
- Message size: 32 bytes
- Term length: 1 MB (1,048,576 bytes)
- Warmup: 10,000 messages
- Measurement: 100,000 messages
- Pre-fault: enabled
Target:
- p50 < 200 ns
- p99 < 1,000 ns
Throughput (IPC)
Section titled “Throughput (IPC)”What: Measures the maximum sustained message publish rate through an IPC channel, with periodic polling to prevent buffer exhaustion.
Procedure:
- Create an IPC channel (
/zigbolt_bench_tp) with 4 MB term length - Record start timestamp
- Publish 10,000,000 messages of 64 bytes each
- On publish failure (buffer full): poll 1,024 messages, retry
- Every 10,000 publishes: poll up to 10,000 messages
- Record end timestamp
- Compute:
msg/sec = count / elapsed,MB/sec = msg/sec * msg_size / 1MB
Configuration:
- Message size: 64 bytes
- Term length: 4 MB
- Message count: 10,000,000
- Pre-fault: enabled
Target:
- > 50 million messages/second
UDP RTT (Loopback)
Section titled “UDP RTT (Loopback)”What: Measures UDP round-trip latency over the loopback interface. Sends a datagram from one socket and receives it on another, both bound to localhost.
Procedure:
- Create sender UDP channel (port 44445, non-blocking)
- Create receiver UDP channel (port 44444, non-blocking)
- Warm up with 5,000 messages (discarded)
- Drain any remaining datagrams
- For each of 50,000 measurement iterations:
- Record
send_time, embed in 32-byte message - Send via sender socket to receiver’s port
- Busy-poll receiver socket (up to 10,000 attempts)
- Record
recv_time, compute RTT - Add to HDR histogram
- Record
- Report percentiles
Configuration:
- Message size: 32 bytes
- Ports: 44444 (receiver), 44445 (sender)
- Warmup: 5,000 messages
- Measurement: 50,000 messages
- Non-blocking: enabled
Target:
- p50 < 5 us (an io_uring backend is planned and expected to lower this on Linux)
WireCodec Throughput
Section titled “WireCodec Throughput”What: Measures the raw encode/decode throughput of the comptime WireCodec
for TickMessage (32B) and OrderMessage (48B), including both single-message
and batch (64-message) modes.
Procedure:
- Warm up with 100,000 encode operations (discarded)
- For 10,000,000 iterations:
- Encode a message with varying fields (prevents constant-folding)
- Accumulate a sink byte to prevent dead-code elimination
- Repeat for decode with
doNotOptimizeAwayon the result - Repeat for batch encode/decode (64 messages per batch)
- Report: ns/msg, M/sec, MB/sec bandwidth
Configuration:
- Message types: TickMessage (32B), OrderMessage (48B)
- Iterations: 10,000,000
- Batch size: 64 messages
- Anti-optimization: varying input fields + sink accumulator
Target:
- Encode: < 10 ns/msg (> 100M msg/sec)
- Decode: < 10 ns/msg (> 100M msg/sec)
- Batch encode: > 150M msg/sec
SPSC Ring Buffer Latency
Section titled “SPSC Ring Buffer Latency”What: Measures the single-producer single-consumer ring buffer write/read round-trip latency across multiple message sizes.
Procedure:
- Initialize a 64K-entry SPSC ring buffer
- Warm up with 10,000 write/read pairs (discarded)
- For 100,000 measurement samples:
- Batch 100 write/read pairs
- Record per-operation average in HDR histogram
- Report percentiles for each message size
Configuration:
- Ring capacity: 65,536 entries
- Message sizes: 8B, 32B, 64B, 256B
- Warmup: 10,000 ops
- Samples: 100,000 (x100 batch = 10M ops)
Target:
- p50 < 50 ns (8B-64B messages)
- p99 < 200 ns
MPSC Ring Buffer Latency
Section titled “MPSC Ring Buffer Latency”What: Measures the multi-producer single-consumer ring buffer latency under contention from multiple writer threads.
Configuration:
- Multiple producer threads writing concurrently
- Single consumer thread reading
- Measures contention overhead vs SPSC baseline
Target:
- p50 < 100 ns (under moderate contention)
- p99 < 500 ns
LogBuffer Throughput
Section titled “LogBuffer Throughput”What: Measures the LogBuffer claim/commit/read cycle latency, which is the foundation of the Aeron-style term buffer used by IPC channels.
Procedure:
- Initialize a LogBuffer with 64K term length
- Warm up with 10,000 claim/commit/read cycles
- Reset the buffer
- For 50,000 measurement samples:
- Batch 100 claim/commit/read cycles
- On claim failure: drain 4,096 messages and retry
- Record per-operation average in HDR histogram
- Report percentiles for each message size
Configuration:
- Term length: 65,536 bytes
- Message sizes: 32B, 64B, 256B
- Warmup: 10,000 ops
- Samples: 50,000 (x100 batch = 5M ops)
Target:
- p50 < 100 ns
- p99 < 500 ns
IPC Multi-Size
Section titled “IPC Multi-Size”What: Measures IPC channel latency across different message sizes to characterize how payload size affects publish/poll performance.
Configuration:
- Message sizes: 64B, 256B, 1024B
- Term length: 4 MB
- Pre-fault: enabled
Target:
- 64B: p50 < 200 ns
- 1024B: p50 < 500 ns
Results Format
Section titled “Results Format”All latency benchmarks output HDR histogram percentiles. The sample outputs below are illustrative of the format only — they are not measured claims:
=== Results === Total samples: 100000 Min: 45 ns Mean: 132.7 ns p50: 120 ns p90: 180 ns p99: 450 ns p99.9: 1200 ns p99.99: 3500 ns Max: 15000 ns
[PASS] p50 = 120 ns (target: <200 ns) [PASS] p99 = 450 ns (target: <1000 ns)Throughput benchmark output:
=== Throughput Results === Published: 10000000 msgs Elapsed: 0.150 sec Throughput: 66.7 M/sec Bandwidth: 4053.3 MB/sec
[PASS] > 50M msg/sec target met!WireCodec benchmark output:
=== ZigBolt WireCodec Throughput Benchmark === Iterations: 10000000 Batch size: 64
[TickMessage (32B)] Encode: 3.2 ns/msg (312 M/sec) Decode: 2.8 ns/msg (357 M/sec) Batch encode: 450 M/sec Batch decode: 420 M/sec Bandwidth: 9536 MB/sec (encode) [PASS] encode < 10 ns/msgFull suite (bench_run_all) summary output (illustrative — numbers vary by machine):
╔═══════════════════════════════════════════════════════════════════════════════╗║ Benchmark Summary ║╠════════════════╦═══════╦═════════╦═════════╦═════════╦═════════╦══════════════╣║ Transport ║ Size ║ p50 ║ p99 ║ p99.9 ║ Max ║ Throughput ║╠════════════════╬═══════╬═════════╬═════════╬═════════╬═════════╬══════════════╣║ SPSC ║ 8B ║ 12 ns ║ 45 ns ║ 120 ns ║ 500 ns ║ 83.3 M/s ║║ SPSC ║ 32B ║ 15 ns ║ 50 ns ║ 150 ns ║ 600 ns ║ 66.7 M/s ║║ IPC ║ 64B ║ 120 ns ║ 350 ns ║ 900 ns ║ 3000 ns ║ 8.3 M/s ║║ Codec-Enc ║ 32B ║ 3 ns ║ 0 ns ║ 0 ns ║ 0 ns ║ 333.3 M/s ║║ Codec-Dec ║ 32B ║ 2 ns ║ 0 ns ║ 0 ns ║ 0 ns ║ 500.0 M/s ║║ LogBuffer ║ 64B ║ 35 ns ║ 120 ns ║ 300 ns ║ 1500 ns ║ 28.6 M/s ║╚════════════════╩═══════╩═════════╩═════════╩═════════╩═════════╩══════════════╝The full suite also writes bench/results.json with structured data for CI
integration and automated regression detection.
Performance Targets
Section titled “Performance Targets”The numbers below are design targets, not measured results. The only
measured data shipped with the repository is bench/results.json (one local
run of the suite, covering SPSC, IPC, LogBuffer, and codec rows). That file
contains no ping-pong RTT or UDP RTT rows, and its codec-encode row is
degenerate (sub-nanosecond timer resolution), so it does not substantiate
cross-platform latency comparisons. Run the suite on your own hardware for
real numbers.
| Benchmark | Metric | Target |
|---|---|---|
| IPC Ping-Pong | p50 RTT | < 200 ns |
| IPC Ping-Pong | p99 RTT | < 1,000 ns |
| IPC Throughput | msg/sec | > 50M |
| IPC Throughput | bandwidth | > 3 GB/s |
| UDP RTT | p50 | < 5 us |
| WireCodec Encode | ns/msg | < 10 ns |
| WireCodec Decode | ns/msg | < 10 ns |
| WireCodec Batch | msg/sec | > 150M |
| SPSC Ring | p50 | < 50 ns |
| SPSC Ring | p99 | < 200 ns |
| MPSC Ring | p50 | < 100 ns |
| LogBuffer | p50 | < 100 ns |
Performance varies by:
- CPU architecture and cache hierarchy
- OS kernel version and scheduler configuration
- NUMA topology (for multi-socket systems)
- Core isolation (
isolcpus,nohz_full) on Linux - Background system load
Tuning for Best Results
Section titled “Tuning for Best Results”# Isolate CPU cores for benchmarkssudo grubby --update-kernel=ALL --args="isolcpus=2,3 nohz_full=2,3"
# Pin benchmark to isolated coretaskset -c 2 ./zig-out/bin/bench_ping_pong
# Disable frequency scalingecho performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Increase socket buffer sizessudo sysctl -w net.core.rmem_max=16777216sudo sysctl -w net.core.wmem_max=16777216# Ensure Xcode command-line tools are installedxcode-select --install
# Disable Spotlight indexing on benchmark pathssudo mdutil -i off /tmp
# Close unnecessary applications to reduce noiseHDR Histogram
Section titled “HDR Histogram”The benchmarks use a custom lightweight HDR (High Dynamic Range) histogram
implementation in bench/hdr_histogram.zig. It provides:
- Constant memory footprint (bucket array)
- O(1) recording
- Accurate percentile computation
- No allocations during measurement
This avoids measurement perturbation that would occur with a heap-allocating histogram.