Designing Ultra-Low Latency Shared-Memory Queues for HFT in C++

Eliminate cache-coherence bottlenecks, false sharing, and memory-copy overheads in multi-consumer execution systems.

Jun 02, 2026

∙ Paid

The Book is available now!

Begin with a concrete, minimal primitive: a bounded, non‑blocking, single‑writer / many‑reader ring buffer in shared memory used for local fan‑out. The goal is narrow and practical: deliver the same byte payload to multiple local consumers without entering the kernel, without per‑consumer coordination, and without letting one slow consumer block the producer. That constraint set—single writer, multiple readers, variable‑length messages, fixed capacity—drives every decision in the baseline.

What the primitive must provide

Deterministic publication semantics: every consumer should be able to observe every message the producer successfully publishes (until the message is overwritten by wrap‑around).
Non‑blocking producer: the writer never waits for readers; instead the system enforces a fixed capacity and relies on external handling (metrics, backpressure mechanisms, or drops) if consumers fall behind.
Simple, low‑overhead metadata: minimize atomic read‑modify‑write (RMW) operations and avoid false sharing between writer and readers.

Metadata layout: two per‑cache‑line counters At its heart the header is two monotonic 64‑bit counters: writepos and readpos (both expressed as monotonically increasing byte counts). Place each on its own cache line—either with explicit padding or using alignas(cache_line)—to avoid false sharing between cores that update or poll the counters. The semantic roles:

write_pos: the producer advances this to reserve space for a message. It represents the number of bytes the producer has claimed (reserved or published) since the queue was created.
readpos (a.k.a. publishedpos): the producer advances this only after the payload has been copied into the ring. Readers observe this counter to discover how many bytes are safely published and available for consumption.

Why byte counters instead of indexes: monotonic byte counters avoid frequent modulo arithmetic at every atomic update and make wrap detection straightforward. Ring positions are computed as write_pos % capacity when necessary.

Writer sequence (high level)

Compute message footprint: size prefix + payload length (size prefix allows variable‑length messages).

Atomically advance write_pos by the footprint to reserve a contiguous window. This atomic add is the only writer atomic update that modifies shared metadata.

Copy the size prefix and payload into the reserved region within the ring, using the ring index derived from the reserved start offset (modulo capacity). Handle potential crossing of the ring boundary—production code must either split the write or place a sentinel—omitted in the conceptual baseline for clarity.

Atomically advance readpos by the footprint to publish the message. Readers only consider bytes at or below readpos as visible.

Readers’ responsibilities and light synchronization

Readers never modify shared metadata in the baseline. They sample read_pos (an atomic load) to learn how many bytes are published.
If there are published bytes, a reader reads the size prefix at the appropriate offset (using its own observed consumer cursor to avoid overlapping reads) and determines the payload length.
The reader copies the payload into its local buffer and may re‑read read_pos (or the size prefix) to validate that the payload was fully published while the copy occurred. If the read validation fails (i.e., the writer published fewer bytes or a concurrent wrap invalidated the window), the reader treats the situation per policy (typically abort and retry or fail‑fast in demo code).

Fan‑out semantics, not load balancing This primitive is for fan‑out: every consumer gets the same logical stream. The writer is permitted to overwrite old data when the monotonic counters indicate the ring capacity is exhausted; the system assumes either consumers keep up or the application handles missed messages. This choice deliberately isolates a slow or crashed consumer from stalling the producer and other consumers.

Variable‑length messages and copy‑in‑queue policy Messages are size‑prefixed and stored as contiguous bytes in the ring. The API shape is intentionally simple: write(bytes) on the producer side and read(buffer) on the reader side (returning zero or a status when empty). Copying payloads into the ring—rather than transmitting pointers—preserves process isolation, simplifies lifetime reasoning across process boundaries, and avoids remote‑heap pointer invalidity when consumers are separate processes.

Correctness caveats and practical omissions This section presents a conceptual minimum; several correctness details are intentionally elided from the demo to keep the core idea crystal clear.

Wrap‑around: production code must correctly handle reservations that cross the ring end. Strategies include splitting writes into two segments, inserting padding to align the next reservation, or refusing a reservation that doesn't fit and wrapping the write pointer explicitly.
Language‑level data races: the baseline copies payload bytes without per‑byte atomic operations. Under C/C++ memory models this can create undefined behavior if a reader copies bytes concurrently with a writer writing them. The demo treats such races as fatal (fail‑fast) for simplicity, but a production system must either ensure publication ordering (via memory fences and atomic stores for critical fields), adopt atomic memcpy abstractions, or accept the practical safety handwaving when both sides are cooperative and run on the same host.
SEQLOCK and other optimistic patterns: sequence‑lock‑style publication looks attractive but is tricky to implement portably and efficiently in C/C++. Sequence counters require careful memory‑ordering discipline and still may force expensive RMWs; they do not magically eliminate validation costs.

Practical header and shared‑memory setup The queue lives in a shared memory mapping: a small protocol header (magic/version/parameters) followed by the ring buffer and the cache‑line aligned counters. Create and map the shared region once at process start (POSIX shared‑memory or equivalent) so all participants use the same offsets and capacity. Keep the header placement stable to avoid ABI mismatches between producer and consumers.

Tradeoffs to keep in mind

Copying payloads increases memory traffic but simplifies correctness across process boundaries.
A fixed bounded ring enforces non‑blocking producer behavior; dynamic resizing trades simplicity and speed for flexibility.
The single‑writer assumption drastically reduces contention; supporting multiple writers requires different reservation schemes and comes with a measurable throughput cost.

Takeaway The minimal, high‑performance fan‑out queue is defined by two per‑cache‑line 64‑bit counters, a single writer that reserves and then publishes by advancing those counters, and multiple readers that poll published bytes and copy payloads out. This design yields a bounded, non‑blocking primitive suitable for local fan‑out, but it requires careful handling of wrap‑around, an awareness of language‑level data‑race pitfalls, and disciplined publication semantics if you intend to move from a conceptual demo to production. Subsequent sections show how to turn this baseline into a much faster, robust primitive: amortize atomics via bulk reservations, cache read positions to reduce metadata traffic, and expose zero‑copy writer APIs to double realistic throughput.

Figure 11-1. The baseline queue is a fixed-capacity ring plus two hot counters isolated on separate cache lines. The producer reserves byte ranges and publishes only after payload bytes are safe for readers.

Contention Model: Where Cycles Go in Multi-Reader Fan-Out

We begin from a deceptively small surface: two 64‑bit counters in the queue header — a producer pointer (write/produce) and a visibility pointer (read/publish) — each pinned to its own cache line. That simple layout makes the hot spot obvious in a single‑writer / many‑reader fan‑out: the producer advances position metadata and every consumer repeatedly reads the producer’s position to discover new messages. Under load, that single cache line becomes the dominant source of coherence traffic; understanding how that traffic maps to machine events tells you exactly what to optimize and how to measure success.

Cache-Line Contention in Multi-Reader Fan-Out

Figure 11-2. The shared write-position cache line becomes the contention point: producer updates and reader polls force ownership transfers, invalidations, and remote reads.

Where the cycles go

Writer actions: reserve (atomic RMW or atomic add) → copy payload bytes into the ring → publish visibility (another atomic). Each atomic RMW produces a cache‑line ownership transfer: the producer must take the write‑counter cache line into its core’s modified state, perform the update, and release it. If the producer reserves frequently, those transfers happen at roughly the message rate.
Reader actions: poll the producer’s position (atomic load) to check for new data; when positive, read the size prefix and payload bytes; re‑check the producer’s position. Each reader load can cause cache‑to‑cache transfers or remote reads, depending on topology and coherence state. With N readers, the producer’s single cache line gets repeatedly bounced between the producer and all readers.

This ping‑pong dominates latency and throughput: the producer’s updates and the readers’ polls induce frequent cache invalidations and remote transfers, which cost orders of magnitude more than a local load/store. The queue’s bounded, non‑blocking contract amplifies the problem: the producer never waits for slow readers, so we cannot shift the cost to blocking or locks — instead we must reduce the frequency of those coherence events.

Translate contention to concrete goals To convert intuition into engineering targets, focus on two measurable metrics:

Producer atomic update frequency to the write counter (updates/sec). Goal: reduce RMWs per second.

Per‑reader atomic load frequency of the write counter (loads/sec per reader). Goal: reduce redundant loads when a reader is actively draining data it already knows is present.

Why these two? Because every atomic update and atomic load on that cache line maps to a cache‑line transfer or an expensive remote access. Reducing these two rates proportionally reduces coherence traffic, CPU stalls waiting for cache line ownership, and worst‑case latency.

Quantifying the impact: a simple model Let:

S = average payload+overhead size in bytes
M = messages/sec produced
Rw = reservation window in bytes (size of a bulk reservation)

Baseline writer atomic rate (no reservation) ≈ M (one atomic per message to reserve/publish). With bulk reservation of size Rw, the producer performs approximately M * S / Rw atomic updates per second (one atomic per reservation window). Put another way, the writer atomic rate drops by roughly the factor Rw / S. Example: S = 100 B, Rw = 100 KiB → reduction ≈ 1024; one atomic per ~1,000 messages.

For readers, suppose a reader drains an available window of A bytes before reloading the write counter. If the reader caches the most recent producer position and only reloads when its cached window is exhausted, the reader atomic load frequency is approximately (M S / A) / N for N readers that split work, or simply M S / A for a single reader draining all messages. The important point: increasing A (the cached available window) reduces reader loads linearly.

Key optimizations that follow directly from this model

Bulk reservation (producer): reserve Rw bytes with one atomic add and use that space to write many messages; publish visibility either per message or in batched publishes depending on correctness model. This amortizes writer atomics by Rw / S.
Reader‑side caching: after a successful atomic read that reports K bytes available, keep that K locally and serve subsequent reads from it without reloading the producer position until it’s exhausted. This eliminates redundant atomic loads while draining known availability.
Compact payload alignment: prefer 8‑byte alignment on x86 rather than cache‑line alignment. Tight packing increases S packing density (more messages per reservation), improves prefetch and cache locality for payload scanning, and keeps metadata confined to dedicated cache lines.

Tradeoffs and correctness caveats

Freshness vs load: larger producer reservations mean that readers may not see newly produced bytes until the producer publishes visibility; readers will observe data less frequently if the producer withholds large segments. Tune Rw as a small fraction of ring capacity (1–2% is conservative for high‑throughput systems).
Wrap‑around and capacity: reservations must not overrun unread data. Implement wrap‑around checks and ensure reservation atomics consider current consumer positions. Production code must handle the ring boundary and reclaim space safely.
Language memory model: copying payload bytes non‑atomically while a writer may be concurrently writing them creates a data race under C/C++ rules. Practical implementations either:
Detect inconsistent reads and retry, accepting undefined‑behavior risk but ensuring protocol correctness in practice; or
Use atomic byte copies or stronger synchronization primitives at the cost of performance. Be explicit about the model you accept and document the tradeoffs.
Reader starvation: if Rw is too large relative to consumer lag and ring capacity, readers might observe sudden reductions in visible data, which can increase observed latency. Balance reservation size with operational constraints.

How to verify improvements Measure before/after using hardware counters and end‑to‑end latency distributions:

Atomic update and atomic load rates (instrument your queue to count successful increments/loads).
Cache coherence events: cache‑line invalidations, remote cache transfers, and interconnect snoop rates (perf events like LLCMISS, MEMLOADUOPSLLCMISSCAUSESREMOTEDRAM or coherence counters depending on platform).
Latency P50/P99/P999 for both producer and consumers.

Expectations after the three optimizations: writer atomic add rate falls by roughly Rw / S; per‑reader atomic loads drop by the cached window factor; coherence transfer counters fall accordingly and latency tails tighten. If measurements disagree, inspect wrap‑around bugs, excessive reservation sizes, or language‑level races causing retries.

Summary checklist

Confirm producer and visibility counters are on separate cache lines.
Pick a reservation Rw from throughput and ring capacity heuristics (start small: 1–2% of ring).
Implement reader caching of the producer position and only refresh when local available bytes are exhausted.
Keep payload alignment compact (8‑byte on x86).
Instrument atomic rates and coherence counters; validate reductions against predicted factors.

Understanding the cache‑line ownership picture gives you a direct lever: reduce the number of times that hot cache line changes ownership. Bulk reservation and reader caching are precisely the levers that do this, and the math above gives you a means to size them and verify results.

Bulk Reservation: Amortizing Writer Atomics

The single-writer, many-reader queue makes one piece of metadata the focal point of contention: the producer's write position. Every produced message increments that counter; every reader polls it to know what bytes are available. Under high fan‑out this single 8‑byte counter jumps between cores so often that cache‑coherence traffic—not CPU work or memory copies—becomes the bottleneck. Bulk reservation converts that steady stream of tiny atomic RMWs into a much smaller number of larger RMWs, which reduces cache‑line ownership transfers by orders of magnitude and lets the producer write many messages without touching shared metadata.

What bulk reservation does, conceptually

Designing Ultra-Low Latency Shared-Memory Queues for HFT in C++

Eliminate cache-coherence bottlenecks, false sharing, and memory-copy overheads in multi-consumer execution systems.

The Book is available now!

Contention Model: Where Cycles Go in Multi-Reader Fan-Out

Bulk Reservation: Amortizing Writer Atomics

This post is for paid subscribers