You probably want to put pointers in your queue, rather than actually copying data into / out of the shared ring itself. i.e. the ring buffer payload is just a pointer.
Release/acquire semantics takes care of making sure that the data is there when you dereference a pointer you get from the queue. But then you have a deallocation problem: how does a producer know when a consumer is done using a buffer so it can reuse it?
If it's ok to hand over ownership of the buffer, then that's fine. Maybe the consumer can use the buffer for something else, like add it to a local free-list or maybe use it for something it produces.
For the following, see the ring-buffer based lockless MPMC queue analyzed in Lock-free Progress Guarantees. I'm imagining modifications to it that would make it suit your purposes.
It has a read-index and a write-index, and each ring-buffer node has a sequence counter that lets it detect writers catching up with readers (queue full) vs. readers catching up with writers (queue empty), without creating contention between readers and writers. (IIRC, readers read the write-index or vice versa, but there's no shared data that's modified by both readers and writers.)
If there's a reasonable upper bound on buffer sizes, you could have shared fixed-size buffers associated with each node in the ring buffer. Like maybe 1kiB or 4kiB. Then you wouldn't need a payload in the ring buffer; the index would be the interesting thing.
If memory allocation footprint isn't a big deal (only cache footprint) even 64k or 1M buffers would be mostly fine even if you normally only use the low 400 bytes of each. Parts of the buffer that don't get used will just stay cold in cache. If you're using 2MiB hugepages, buffers smaller than that are a good idea to reduce TLB pressure: you want multiple buffers to be covered by the same TLB entry.
But you'd need to claim a buffer before writing to it, and finish writing to it before finishing the second step of adding an entry to the queue. You probably don't want to do more than just memcpy
, because a partially-complete write blocks readers if it becomes the oldest entry in the queue before it finishes. Maybe you could write-prefetch the buffer (with prefetchw
on Broadwell or newer)
before trying to claim it, to reduce the time between you're (potentially) blocking the queue. But if there's low contention for writers, that might not matter. And if there's high contention so you don't (almost) always succeed at claiming the first buffer you try, write-prefetch on the wrong buffer will slow down the reader or writer that does own it. Maybe a normal prefetch would be good.
If buffers are tied directly to queue entries, maybe you should just put them in the queue, as long as the MPMC library allows you to use custom reader code that reads a length and copies out that many bytes, instead of always copying a whole giant array.
Then every queue control entry that producers / consumers look at will be in a separate cache line, so there's no contention between two producers claiming adjacent entries.
If you need really big buffers because your upper bound is like 1MiB or something, retries because of contention will lead to touching more TLB entries, so a more compact ring buffer with the large buffers separate might be a better idea.
A reader half-way through claiming a buffer doesn't block other readers. It only blocks the queue if it wraps around and a producer is stuck waiting for it. So you can definitely have your readers use the data in-place in the queue, if it's big enough and readers are quick. But the more you do during a partially-complete read, the higher chance that you sleep and eventually block the queue.
This is a much bigger deal for producers, especially if the queue is normally (nearly) empty: consumers are coming up on newly-written entries almost as soon as they're produced. This is why you might want to make sure to prefetch the data you're going to copy in, and/or the shared buffer itself, before running a producer.
400 bytes is only 12.5 cycles of committing 32 bytes per clock to L1d cache (e.g. Intel Haswell / Skylake), so it's really short compared to inter-core latencies or the time you have to wait for an RFO on a cache write-miss. So the minimum time between a producer making the claim of a node globally visible to when you complete that claim so readers can read it (and later entries) is still very short. Blocking the queue for a long time is hopefully avoidable.
That much data even fits in YMM 13 registers, so a compiler could in theory actually load the data into registers before claiming a buffer entry, and just do stores. You could maybe do this by hand with intrinsics, with a fully-unrolled loop. (You can't index the register file, so it has to be fully unrolled, or always store 408 bytes, or whatever.)
Or 7 ZMM registers with AVX512, but you probably don't want to use 512-bit loads/stores if you aren't using other 512-bit instructions, because of the effects on max-turbo clock speed and shutting down port 1 for vector ALU uops. (I assume that still happens with vector load/store, but if we're lucky some of those effects only happen with 512-bit ALU uops...)