0

I was trying to build a MPSC lock-free ring buffer for learning purpose, and am running into race conditions.

A description of the MPSC ring buffer:

  • It is guaranteed that poll() is never called when the buffer is empty.
  • Instead of mod'ing head and tail like a traditional ring buffer, it lets them proceed linearly, and AND's them before using them (since the buffer size is a power of 2, this works ok with overflow).
  • We keep MAX_PRODUCERS-1 slots open in the queue so that if multiple producers come and see one slot is available and proceed, they can all place their entries.
  • It uses 32-bit quantities for head and tail, so that it can snapshot them with a 64-bit atomic read without a lock.

My test involves a couple of threads writing some known set of values to the queue, and a consumer thread polling (when the buffer is not empty) and summing all, and verifying the correct result is obtained. With 2 or more producers, I get inconsistent sums (and with 1 producer, it works).

Any help would be much appreciated. Thank you!

Here is the code:

struct ring_buf_entry {
  uint32_t seqn;
};

struct __attribute__((packed, aligned(8))) ring_buf {
  union {
    struct {
      volatile uint32_t tail;
      volatile uint32_t head;
    };
    volatile uint64_t snapshot;
  };
  volatile struct ring_buf_entry buf[RING_BUF_SIZE];
};

#define RING_SUB(x,y) ((x)>=(y)?((x)-(y)):((x)+(1ULL<<32)-(y)))

static void ring_buf_push(struct ring_buf* rb, uint32_t seqn)
{
  size_t pos;
   
  while (1) {
    // rely on aligned, packed, and no member-reordering properties
    uint64_t snapshot = __atomic_load_n(&(rb->snapshot), __ATOMIC_SEQ_CST);
    // little endian.
    uint64_t snap_head = snapshot >> 32;
    uint64_t snap_tail = snapshot & 0xffffffffULL;

    if (RING_SUB(snap_tail, snap_head) < RING_BUF_SIZE - MAX_PRODUCERS + 1) {
      uint32_t exp = snap_tail;
      if (__atomic_compare_exchange_n(&(rb->tail), &exp, snap_tail+1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
        pos = snap_tail;
        break;
      }
    }

    asm volatile("pause\n": : :"memory");
  }

  pos &= RING_BUF_SIZE-1;

  rb->buf[pos].seqn = seqn;
    
  asm volatile("sfence\n": : :"memory");
}

static struct ring_buf_entry ring_buf_poll(struct ring_buf* rb)
{
  struct ring_buf_entry ret = rb->buf[__atomic_load_n(&(rb->head), __ATOMIC_SEQ_CST) & (RING_BUF_SIZE-1)];
  __atomic_add_fetch(&(rb->head), 1, __ATOMIC_SEQ_CST);
  return ret;
}
Mihir Shah
  • 39
  • 1
  • 5
  • FYI, instead of `__attribute__((packed))`, you could just apply C11 `_Alignas(8)` to the 8-byte union member of the struct, or to `snapshot`. (Or `#include ` to define `alignas` as `_Alignas`.) You don't need `packed` for anything, just alignment of that one member even on a weird compiler that would invent padding where none is needed. – Peter Cordes Jul 03 '22 at 02:07
  • x86 `sfence` is useless unless you're using NT stores. Normal stores are already ordered at run-time, you only need to prevent compile-time reordering if you want store ordering, thus `asm("":::"memory")`. See [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/a/50780314). But see also [Who's afraid of a big bad optimizing compiler?](https://lwn.net/Articles/793253/) on LWN re: using barriers without `volatile` for stores that should be visible across threads. Oh, but `buf[]` is already `volatile`. Also not sure why you want a store-store barrier there – Peter Cordes Jul 03 '22 at 02:11
  • Since it's a single-consumer queue, the reader doesn't seem to need an atomic RMW to increment `head`. It could just load head into a temporary, and separately atomic-store `old_head+1`. I haven't spotted what might be causing a bug with multiple producers, but that shouldn't make it worse. – Peter Cordes Jul 03 '22 at 02:20
  • How does the `ring_buf_poll` reader know the queue isn't empty? What if it runs faster than the producers, and loops around to re-read sequence numbers it's already seen? With multiple producers contending with each other, perhaps that's happening. – Peter Cordes Jul 03 '22 at 02:21
  • BTW, `RING_SUB` looks over-complicated. 32-bit unsigned subtraction Just Works even across wrap-around. e.g. `0x01 - 0xff` gives 2 in 8-bit unsigned. `0x1 - 0x2` gives `0xff` in 8-bit unsigned, the max value. So you can zero-extend to 64 *after* subtraction if you want/need. – Peter Cordes Jul 03 '22 at 02:25
  • 1
    Oh ok, thank you for this feedback, I will incorporate into my implementation. I had a testing bug along the lines of what you described. – Mihir Shah Jul 03 '22 at 18:22

0 Answers0