Lock-Free Ring Buffer

What is a Lock-Free Ring Buffer

December 24, 2024

Sure — let’s break that down and go deep into what it means to build a lock-free ring buffer with memory fences and cache-aware data structures, especially in the context of high-performance inter-thread or inter-process messaging.

Lock-Free Ring Buffer

A ring buffer (or circular buffer) is a fixed-size array where a producer writes data to a position and a consumer reads from a position, both wrapping around the end of the buffer.

In a lock-free implementation:

No mutexes or blocking primitives (like std::mutex, std::condition_variable, etc.)
Coordination is done via atomic variables (e.g., sequence numbers) and memory ordering (using fences)
Producer and consumer operate independently without interfering with each other — assuming single-producer/single-consumer or proper synchronization in multi-threaded use

Minimal API

bool try_push(const T& item);
bool try_pop(T& out);

You usually maintain:

head (write position)
tail (read position)

Using a pattern like:

if ((head + 1) % size == tail) { /* buffer full */ }
buffer[head] = item;
head = (head + 1) % size;

Memory Fences (or Memory Barriers)

Modern CPUs reorder memory instructions for performance. A memory fence enforces ordering and visibility constraints across threads or cores.

Without fences, one CPU core may not see updates another has made — leading to stale reads or write reordering bugs.

Key Primitives in C++:

std::atomic_thread_fence(std::memory_order_release)
std::atomic_thread_fence(std::memory_order_acquire)
Or use std::atomic<T> with fine-grained memory orders (memory_order_relaxed, acquire, release, etc.)

Example

On write:

buffer[head] = item;
std::atomic_thread_fence(std::memory_order_release);
head_index.store(next_head, std::memory_order_relaxed);

On read:

if (head_index.load(std::memory_order_relaxed) == tail_index)
    return false;
std::atomic_thread_fence(std::memory_order_acquire);
item = buffer[tail];
tail_index++;

This guarantees producer writes are visible to the consumer before the consumer reads the data.

Cache-Aware Data Structures

Modern CPUs use multi-level caches (L1, L2, L3) and rely heavily on cache lines (usually 64 bytes). Performance can be destroyed by:

False sharing: multiple threads writing to different variables that sit on the same cache line
Cache thrashing: accessing memory in ways that invalidate cache frequently

Techniques

Padding / Alignment

To avoid false sharing:

struct alignas(64) PaddedAtomic {
    std::atomic<size_t> index;
    char padding[64 - sizeof(std::atomic<size_t>)];
};

Each critical field sits on a separate cache line.

Prefetching and Access Patterns

Use linear scans and predictable memory access so the CPU can prefetch data efficiently.

Preallocation

All memory is preallocated; no new or heap allocation during operation. This keeps memory locality tight and GC (if applicable) off your back.

Combining It All

Structure

+-----------------------------+
| Ring Buffer [N items]       |
| - Aligned data entries      |
| - Atomic indices (head/tail)|
+-----------------------------+
Producer Thread:
  - Writes to slot N
  - Issues release fence
  - Updates head

Consumer Thread:
  - Loads head
  - Issues acquire fence
  - Reads slot N

Thread-safe Communication

Single-producer/single-consumer is the simplest case.
For multiple producers or consumers, use:
- Per-thread write cursors
- CAS (compare-and-swap) for index updates
- More complex coordination patterns like barriers or tickets

Use Cases

Market data ingestion
Trading strategy pipelines
Real-time log or telemetry processors
Shared-memory IPC systems
Low-latency game engines or simulation loops

Resources & Inspirations

LMAX Disruptor (Java, core design)
Agrona (Java low-level lib behind Disruptor)
concurrentqueue (C++) – a fast multi-producer, multi-consumer queue
Facebook folly::ProducerConsumerQueue
Chronicle Queue (Java, memory-mapped queue)

Great question — wraparound is central to how ring buffers work and needs to be handled carefully to avoid overwriting unread data or reading uninitialized data. Let’s break it down.

What Is Wraparound?

A ring buffer is a fixed-size array that treats the end as connected to the beginning:

Index:  0  1  2  3  4  5  6  7
        ^                    ^
      tail                head

If the producer keeps writing, head will eventually reach the end and wrap to index 0. Same for the consumer and tail.

Problem: How Do You Know If the Buffer Is Full or Empty?

You track positions using two indexes (or sequence numbers):

head or write_index: where the producer will write next
tail or read_index: where the consumer will read next

But after wraparound, head == tail can mean:

Buffer is empty, or
Buffer is full (if indexes have wrapped)

Solution Strategies

1. Leave One Slot Empty

This is the simplest and most common strategy.

Buffer is considered full when:
next_head == tail
Buffer is empty when:
head == tail

next_head = (head + 1) % capacity;
if (next_head == tail) {
    // buffer is full
}

Pros: Simple logic, no extra tracking needed
Cons: You sacrifice one slot (max usage is capacity - 1)

2. Use a Counter (Sequence Numbers)

Instead of storing raw indexes, you use monotonically increasing counters:

head_seq and tail_seq are unbounded integers
Physical index = seq % capacity

This removes ambiguity:

Full: head_seq - tail_seq == capacity
Empty: head_seq == tail_seq

if (head_seq - tail_seq == capacity) {
    // full
}
index = head_seq % capacity;
buffer[index] = item;
head_seq++;

Pros: Full utilization of buffer, easy to add multiple producers/consumers
Cons: Slightly more complex logic and possible integer overflow (though 64-bit counters make this a non-issue)

3. Use a Flag or Status Byte

Each slot can store a state or version counter to indicate whether the slot is valid.

This is more common in advanced implementations like:

Multi-producer/multi-consumer
Aeron
LMAX Disruptor

Pros: Supports high concurrency, avoids ambiguity
Cons: More complex memory layout and synchronization logic

Wraparound Safety Summary

Strategy	Handles Wraparound	Max Capacity Usage	Simplicity	Suitable For
Leave one slot empty	Yes	`capacity - 1`	Very easy	Single producer/consumer
Monotonic counters	Yes	Full capacity	Medium	Multi-producer/consumer
Slot status/versions	Yes	Full capacity	Complex	High-performance, concurrent

State in the Ring Buffer

Good question — if you're using per-slot state or metadata in a ring buffer, you typically store something that allows the producer and consumer to independently and safely coordinate access to a slot, especially in high-concurrency or lock-free contexts.

Here are the most common types of state stored per slot:

1. Sequence Number / Version Counter

Each slot has a sequence number (not to be confused with the global head/tail counters). This is used in the Disruptor pattern and is extremely powerful.

Purpose:

Lets the consumer know whether a slot is ready to be read
Lets the producer know whether a slot is safe to overwrite

How It Works:

Initialize each slot's sequence to its index
Producer checks if slot's sequence == slot_index
After writing, producer sets sequence = slot_index + 1
Consumer reads only if sequence == slot_index + 1, and then resets it to slot_index + capacity after consuming

Example:

struct Slot {
    T data;
    std::atomic<int64_t> sequence;
};

Benefits:

Lock-free coordination between producer/consumer
No need for memory fences outside the ring buffer
Supports multiple consumers (with additional care)

2. Status Enum (e.g., EMPTY / FULL)

A simpler approach where each slot contains a status flag:

enum SlotStatus { EMPTY, FULL };

struct Slot {
    T data;
    std::atomic<SlotStatus> status;
};

Producer waits for status == EMPTY, then writes and sets status = FULL
Consumer waits for status == FULL, then reads and sets status = EMPTY

Drawbacks:

Not efficient for high-performance systems
Requires memory fences or relaxed atomics carefully
Doesn’t scale well for multiple producers/consumers

3. Generation / Epoch Counter

In some lock-free queues or memory pools, a generation or epoch is stored to detect:

ABA problems
Whether the slot was reused in a new “generation” after wraparound

struct Slot {
    T data;
    uint64_t epoch;
};

Used more in hazard-pointer or lock-free memory reclamation schemes than in vanilla ring buffers.

4. Owner Thread ID (rare)

In exotic setups, especially with multiple producers or consumers, you might store:

Thread ID
Logical "owner" of the slot
Or even a small state machine (claiming, committing, etc.)

But this is unusual and often overkill for ring buffers.

Summary Table

State Type	Used For	Pros	Cons
Sequence number	Coordination + wraparound	Fast, lock-free, scalable	Slightly complex logic
Status enum	Full/empty flags	Easy to understand	Risk of race conditions
Generation count	Memory reclamation safety	Useful with ABA protection	Overhead unless needed
Owner ID	Thread-specific management	Fine-grained control	Rarely needed

In Practice

If you're aiming for lock-free, high-performance, use sequence numbers per slot — that's what Disruptor, Aeron, and most production-grade ring buffers use. It gives you ordering, wraparound safety, and lets you run many producers and consumers at full speed.

Originally posted: December 24, 2024

Filed Under:

architecture