How We Fixed a Fatal Latency Tail in Our Trading Engine
A step-by-step autopsy of how C++ memory layout assumptions failed under production load, and the microsecond-level fix that saved us.
You will be able to download the entire book as pdf in the next article. So subscribe
Upcoming book: When Nanoseconds Matter: Ultrafast Trading Systems in C++
Order Book Requirements and Update API: Constraints That Drive Layout
An order book is deceptively simple to describe and brutally unforgiving to implement for low-latency trading systems. Before you choose containers or tune memory layouts, be explicit about the invariants you must maintain, the operations the feed supplies, the shape of real traffic, and the operational boundaries that define acceptable latency behavior. Those constraints—not abstract big‑O alone—must drive your data structure decisions.
Book model and invariants
Two ordered sides. The in‑memory book consists of two independent, ordered sequences: bids sorted descending (best bid first) and asks sorted ascending (best ask first). This ordering is an invariant; consumers rely on it to obtain best prices without additional computation.
Price levels versus per‑order identity. Market messages reference individual orders (each with a 64‑bit id, price and volume), but most consumers care about price levels: the aggregate volume at a price and the rank of that price in the ordered side. The implementation therefore maintains both the notion of a level (price + aggregated volume) and the per‑order identity that lets you update or remove specific orders.
Deterministic semantics. Updates must be applied in receipt order (or in the order defined by the feed) to avoid transient inconsistencies. Correctness matters: stale or missed updates produce wrong best prices and can cause costly trading mistakes.
Figure 5-1. The canonical two-sided order book. Bids are maintained in descending price order (best bid first); asks in ascending order (best ask first). The "inside spread" between the best bid and best ask is the focal point for most trading activity. Consumers of the book rely on this strict ordering invariant to obtain the current best prices without additional computation.
Canonical feed API you must support Every exchange feed can be reduced to three primitives you must implement efficiently:
new(order_id, side, price, volume) — introduce a new order.
modify(orderid, deltaornewvolume) — change an existing order’s size (feeds vary; some send deltas, some absolute sizes).
delete(order_id) — remove an order.
Crucially, modify and delete messages typically reference only the order_id. They do not include price or side. That simple fact determines the need for an auxiliary index.
Why a separate id → metadata index is mandatory Because subsequent messages only carry an order identifier, the system must resolve that id back to the level and side where the order lives. The only practical way to do this efficiently is a direct mapping from order_id to the metadata required to apply an update: the side, the price (or a level identifier), and any per‑order bookkeeping (such as pointer into per‑level linked list or the current volume). Without this mapping you would have to search the ordered structure to find the order—unacceptable in a high‑rate feed.
This observation enforces a separation of concerns: an ordered container for level semantics and scanning, plus an auxiliary id→metadata index for constant‑time resolution of modify/delete messages. These two components can, and should, use different underlying containers tuned for their responsibilities.
Container tradeoffs: nodes vs contiguous storage Two natural choices for the ordered, level container demonstrate the tradeoffs you’ll face:
Node‑based containers (e.g., balanced trees)
Pros: logarithmic insertion/removal, stable iterators and node addresses across most operations, straightforward to reference from the id index by storing pointers or iterators.
Cons: poor cache locality. Each node is a separate allocation or small contiguous block; traversals and updates induce pointer chasing and scattered memory access patterns that produce long latency tails under load.



