Benchmarks

This document describes the benchmark methodology, how to run each benchmark, and the target performance numbers.

Overview

ZigBolt ships a comprehensive benchmark suite covering latency, throughput, codec performance, and data structure operations:

Benchmark	Binary	What it Measures
Ping-Pong	`bench_ping_pong`	IPC round-trip latency (RTT)
Throughput	`bench_throughput`	IPC single-direction message rate
UDP RTT	`bench_udp_rtt`	UDP loopback round-trip latency
Codec Throughput	`bench_codec_throughput`	WireCodec encode/decode rate (single + batch)
SPSC Latency	`bench_spsc_latency`	SPSC ring buffer write/read latency
MPSC Latency	`bench_mpsc_latency`	MPSC ring buffer contention latency
LogBuffer Throughput	`bench_logbuffer`	LogBuffer claim/commit/read rate
IPC Multi-Size	`bench_ipc_multisize`	IPC latency across message sizes
Full Suite	`bench_run_all`	All-in-one suite with JSON output

All benchmarks are compiled with -OReleaseFast for maximum optimization.

Building

zig build bench

This builds and runs the full suite (bench/run_all.zig). A plain zig build compiles all individual benchmark binaries into zig-out/bin/.

To run individually:

zig build && ./zig-out/bin/bench_ping_pong
zig build && ./zig-out/bin/bench_throughput
zig build && ./zig-out/bin/bench_udp_rtt
zig build && ./zig-out/bin/bench_codec_throughput
zig build && ./zig-out/bin/bench_spsc_latency
zig build && ./zig-out/bin/bench_mpsc_latency
zig build && ./zig-out/bin/bench_logbuffer
zig build && ./zig-out/bin/bench_ipc_multisize
zig build && ./zig-out/bin/bench_run_all

The bench_run_all binary runs all benchmarks in sequence and outputs a summary table plus a bench/results.json file for CI integration.

Methodology

Ping-Pong (IPC RTT)

What: Measures the time between publishing a message into an IPC channel and immediately polling it back in the same process. This captures the raw shared memory write/read latency without cross-process scheduling overhead.

Procedure:

Create an IPC channel (/zigbolt_bench_pp) with 1 MB term length, pre-faulted pages
Warm up with 10,000 messages (discarded)
Recreate the channel to start clean
For each of 100,000 measurement iterations:
- Record send_time via timestampNs()
- Publish a 32-byte message containing the timestamp
- Poll it back immediately
- Record recv_time, compute rtt = recv_time - send_time
- Add RTT to HDR histogram
Report percentiles: min, mean, p50, p90, p99, p99.9, p99.99, max

Configuration:

Message size: 32 bytes
Term length: 1 MB (1,048,576 bytes)
Warmup: 10,000 messages
Measurement: 100,000 messages
Pre-fault: enabled

Target:

p50 < 200 ns
p99 < 1,000 ns

Throughput (IPC)

What: Measures the maximum sustained message publish rate through an IPC channel, with periodic polling to prevent buffer exhaustion.

Procedure:

Create an IPC channel (/zigbolt_bench_tp) with 4 MB term length
Record start timestamp
Publish 10,000,000 messages of 64 bytes each
- On publish failure (buffer full): poll 1,024 messages, retry
- Every 10,000 publishes: poll up to 10,000 messages
Record end timestamp
Compute: msg/sec = count / elapsed, MB/sec = msg/sec * msg_size / 1MB

Configuration:

Message size: 64 bytes
Term length: 4 MB
Message count: 10,000,000
Pre-fault: enabled

Target:

> 50 million messages/second

UDP RTT (Loopback)

What: Measures UDP round-trip latency over the loopback interface. Sends a datagram from one socket and receives it on another, both bound to localhost.

Procedure:

Create sender UDP channel (port 44445, non-blocking)
Create receiver UDP channel (port 44444, non-blocking)
Warm up with 5,000 messages (discarded)
Drain any remaining datagrams
For each of 50,000 measurement iterations:
- Record send_time, embed in 32-byte message
- Send via sender socket to receiver’s port
- Busy-poll receiver socket (up to 10,000 attempts)
- Record recv_time, compute RTT
- Add to HDR histogram
Report percentiles

Configuration:

Message size: 32 bytes
Ports: 44444 (receiver), 44445 (sender)
Warmup: 5,000 messages
Measurement: 50,000 messages
Non-blocking: enabled

Target:

p50 < 5 us (an io_uring backend is planned and expected to lower this on Linux)

WireCodec Throughput

What: Measures the raw encode/decode throughput of the comptime WireCodec for TickMessage (32B) and OrderMessage (48B), including both single-message and batch (64-message) modes.

Procedure:

Warm up with 100,000 encode operations (discarded)
For 10,000,000 iterations:
- Encode a message with varying fields (prevents constant-folding)
- Accumulate a sink byte to prevent dead-code elimination
Repeat for decode with doNotOptimizeAway on the result
Repeat for batch encode/decode (64 messages per batch)
Report: ns/msg, M/sec, MB/sec bandwidth

Configuration:

Message types: TickMessage (32B), OrderMessage (48B)
Iterations: 10,000,000
Batch size: 64 messages
Anti-optimization: varying input fields + sink accumulator

Target:

Encode: < 10 ns/msg (> 100M msg/sec)
Decode: < 10 ns/msg (> 100M msg/sec)
Batch encode: > 150M msg/sec

SPSC Ring Buffer Latency

What: Measures the single-producer single-consumer ring buffer write/read round-trip latency across multiple message sizes.

Procedure:

Initialize a 64K-entry SPSC ring buffer
Warm up with 10,000 write/read pairs (discarded)
For 100,000 measurement samples:
- Batch 100 write/read pairs
- Record per-operation average in HDR histogram
Report percentiles for each message size

Configuration:

Ring capacity: 65,536 entries
Message sizes: 8B, 32B, 64B, 256B
Warmup: 10,000 ops
Samples: 100,000 (x100 batch = 10M ops)

Target:

p50 < 50 ns (8B-64B messages)
p99 < 200 ns

MPSC Ring Buffer Latency

What: Measures the multi-producer single-consumer ring buffer latency under contention from multiple writer threads.

Configuration:

Multiple producer threads writing concurrently
Single consumer thread reading
Measures contention overhead vs SPSC baseline

Target:

p50 < 100 ns (under moderate contention)
p99 < 500 ns

LogBuffer Throughput

What: Measures the LogBuffer claim/commit/read cycle latency, which is the foundation of the Aeron-style term buffer used by IPC channels.

Procedure:

Initialize a LogBuffer with 64K term length
Warm up with 10,000 claim/commit/read cycles
Reset the buffer
For 50,000 measurement samples:
- Batch 100 claim/commit/read cycles
- On claim failure: drain 4,096 messages and retry
- Record per-operation average in HDR histogram
Report percentiles for each message size

Configuration:

Term length: 65,536 bytes
Message sizes: 32B, 64B, 256B
Warmup: 10,000 ops
Samples: 50,000 (x100 batch = 5M ops)

Target:

p50 < 100 ns
p99 < 500 ns

IPC Multi-Size

What: Measures IPC channel latency across different message sizes to characterize how payload size affects publish/poll performance.

Configuration:

Message sizes: 64B, 256B, 1024B
Term length: 4 MB
Pre-fault: enabled

Target:

64B: p50 < 200 ns
1024B: p50 < 500 ns

Results Format

All latency benchmarks output HDR histogram percentiles. The sample outputs below are illustrative of the format only — they are not measured claims:

=== Results ===
  Total samples: 100000
  Min:     45 ns
  Mean:    132.7 ns
  p50:     120 ns
  p90:     180 ns
  p99:     450 ns
  p99.9:   1200 ns
  p99.99:  3500 ns
  Max:     15000 ns

  [PASS] p50 = 120 ns (target: <200 ns)
  [PASS] p99 = 450 ns (target: <1000 ns)

Throughput benchmark output:

=== Throughput Results ===
  Published:  10000000 msgs
  Elapsed:    0.150 sec
  Throughput: 66.7 M/sec
  Bandwidth:  4053.3 MB/sec

  [PASS] > 50M msg/sec target met!

WireCodec benchmark output:

=== ZigBolt WireCodec Throughput Benchmark ===
  Iterations:  10000000
  Batch size:  64

  [TickMessage (32B)]
    Encode:       3.2 ns/msg  (312 M/sec)
    Decode:       2.8 ns/msg  (357 M/sec)
    Batch encode: 450 M/sec
    Batch decode: 420 M/sec
    Bandwidth:    9536 MB/sec (encode)
    [PASS] encode < 10 ns/msg

Full suite (bench_run_all) summary output (illustrative — numbers vary by machine):

╔═══════════════════════════════════════════════════════════════════════════════╗
║                         Benchmark Summary                                    ║
╠════════════════╦═══════╦═════════╦═════════╦═════════╦═════════╦══════════════╣
║ Transport      ║  Size ║  p50    ║  p99    ║  p99.9  ║  Max    ║  Throughput  ║
╠════════════════╬═══════╬═════════╬═════════╬═════════╬═════════╬══════════════╣
║ SPSC           ║   8B  ║    12 ns ║    45 ns ║   120 ns ║   500 ns ║    83.3 M/s  ║
║ SPSC           ║  32B  ║    15 ns ║    50 ns ║   150 ns ║   600 ns ║    66.7 M/s  ║
║ IPC            ║  64B  ║   120 ns ║   350 ns ║   900 ns ║  3000 ns ║     8.3 M/s  ║
║ Codec-Enc      ║  32B  ║     3 ns ║     0 ns ║     0 ns ║     0 ns ║   333.3 M/s  ║
║ Codec-Dec      ║  32B  ║     2 ns ║     0 ns ║     0 ns ║     0 ns ║   500.0 M/s  ║
║ LogBuffer      ║  64B  ║    35 ns ║   120 ns ║   300 ns ║  1500 ns ║    28.6 M/s  ║
╚════════════════╩═══════╩═════════╩═════════╩═════════╩═════════╩══════════════╝

The full suite also writes bench/results.json with structured data for CI integration and automated regression detection.

Performance Targets

The numbers below are design targets, not measured results. The only measured data shipped with the repository is bench/results.json (one local run of the suite, covering SPSC, IPC, LogBuffer, and codec rows). That file contains no ping-pong RTT or UDP RTT rows, and its codec-encode row is degenerate (sub-nanosecond timer resolution), so it does not substantiate cross-platform latency comparisons. Run the suite on your own hardware for real numbers.

Benchmark	Metric	Target
IPC Ping-Pong	p50 RTT	< 200 ns
IPC Ping-Pong	p99 RTT	< 1,000 ns
IPC Throughput	msg/sec	> 50M
IPC Throughput	bandwidth	> 3 GB/s
UDP RTT	p50	< 5 us
WireCodec Encode	ns/msg	< 10 ns
WireCodec Decode	ns/msg	< 10 ns
WireCodec Batch	msg/sec	> 150M
SPSC Ring	p50	< 50 ns
SPSC Ring	p99	< 200 ns
MPSC Ring	p50	< 100 ns
LogBuffer	p50	< 100 ns

Performance varies by:

CPU architecture and cache hierarchy
OS kernel version and scheduler configuration
NUMA topology (for multi-socket systems)
Core isolation (isolcpus, nohz_full) on Linux
Background system load

Tuning for Best Results

Linux

# Isolate CPU cores for benchmarks
sudo grubby --update-kernel=ALL --args="isolcpus=2,3 nohz_full=2,3"

# Pin benchmark to isolated core
taskset -c 2 ./zig-out/bin/bench_ping_pong

# Disable frequency scaling
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Increase socket buffer sizes
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216

macOS

# Ensure Xcode command-line tools are installed
xcode-select --install

# Disable Spotlight indexing on benchmark paths
sudo mdutil -i off /tmp

# Close unnecessary applications to reduce noise

HDR Histogram

The benchmarks use a custom lightweight HDR (High Dynamic Range) histogram implementation in bench/hdr_histogram.zig. It provides:

Constant memory footprint (bucket array)
O(1) recording
Accurate percentile computation
No allocations during measurement

This avoids measurement perturbation that would occur with a heap-allocating histogram.