Multiple Producer Multiple Consumer Lockfree Non Blocking Ring Buffer With Variable Length Write

Question

I want to pass variable-length messages from multiple producers to multiple consumers, with low latency queue on multi-socket Xeon E5 systems. (400 bytes with a latency of 300 ns would be nice, for example.)

I've looked for existing implementations of lockless multiple-producer-multiple consumer (MPMC) queues using a non-blocking ring-buffer. But most implementations/algorithms online are node based (i.e. node is fixed length) such as boost::lockfree::queue, midishare, etc.

Of course, one can argue that the node type can be set to uint8_t or alike, but then the write will be clumsy and the performance will be horrible.

I'd also like the algorithm to offer overwrite detection on the readers' side that the readers will detect data being overwritten.

How can I implement a queue (or something else) that does this?

@close voters: I rephrased this question to not be a library request. It previous was written that way, but the underlying algorithm-design question is interesting. — Peter Cordes, Aug 17 '18 at 06:36

score 2 · Answer 1 · answered Sep 27 '18 at 10:40

Sorry for a but late answer, but have a look at DPDK's Ring library. It is free (BSD license), blazingly fast (doubt you will find a faster solution for free) and supports all major architectures. There are lot's of examples as well.

to pass variable-length messages

The solution is to pass a pointer to a message, not a whole message. DPDK also offers memory pools library to allocate/deallocate buffers between multiple threads or processes. The memory pool is also fast, lock-free and supports many architectures.

So overall solution would be:

Create mempool(s) to share buffers among threads/processes. Each mempool supports just a fixed size buffer, so you might want to create few mempools to match your needs.
Create one MPMC ring or a set of SPSC ring pairs between your threads/processes. The SPSC solution might be faster, but it might not fit your design.
Producer allocates a buffer, fills it and passes a pointer to that buffer via the ring.
Consumer receives the pointer, reads the message and deallocates the buffer.

Sounds like a lot of work, but there are lots of optimizations inside DPDK mempools and rings. But will it fit 300ns?

Have a look at the official DPDK performance reports. While there is no official report for ring performance, there is a vhost/vistio test results. Basically, packets travel like this:

Traffic gen. -- Host -- Virtual Machine -- Host -- Traffic gen.

Host runs as one process, virtual machine as another.

The test result is ~4M packets per second for 512 byte packets. It does not fit your budget, but you need to do much, much less work...

score 1 · Answer 2 · answered Aug 17 '18 at 05:47

You probably want to put pointers in your queue, rather than actually copying data into / out of the shared ring itself. i.e. the ring buffer payload is just a pointer.

Release/acquire semantics takes care of making sure that the data is there when you dereference a pointer you get from the queue. But then you have a deallocation problem: how does a producer know when a consumer is done using a buffer so it can reuse it?

If it's ok to hand over ownership of the buffer, then that's fine. Maybe the consumer can use the buffer for something else, like add it to a local free-list or maybe use it for something it produces.

For the following, see the ring-buffer based lockless MPMC queue analyzed in Lock-free Progress Guarantees. I'm imagining modifications to it that would make it suit your purposes.

It has a read-index and a write-index, and each ring-buffer node has a sequence counter that lets it detect writers catching up with readers (queue full) vs. readers catching up with writers (queue empty), without creating contention between readers and writers. (IIRC, readers read the write-index or vice versa, but there's no shared data that's modified by both readers and writers.)

If there's a reasonable upper bound on buffer sizes, you could have shared fixed-size buffers associated with each node in the ring buffer. Like maybe 1kiB or 4kiB. Then you wouldn't need a payload in the ring buffer; the index would be the interesting thing.

If memory allocation footprint isn't a big deal (only cache footprint) even 64k or 1M buffers would be mostly fine even if you normally only use the low 400 bytes of each. Parts of the buffer that don't get used will just stay cold in cache. If you're using 2MiB hugepages, buffers smaller than that are a good idea to reduce TLB pressure: you want multiple buffers to be covered by the same TLB entry.

But you'd need to claim a buffer before writing to it, and finish writing to it before finishing the second step of adding an entry to the queue. You probably don't want to do more than just memcpy, because a partially-complete write blocks readers if it becomes the oldest entry in the queue before it finishes. Maybe you could write-prefetch the buffer (with prefetchw on Broadwell or newer) before trying to claim it, to reduce the time between you're (potentially) blocking the queue. But if there's low contention for writers, that might not matter. And if there's high contention so you don't (almost) always succeed at claiming the first buffer you try, write-prefetch on the wrong buffer will slow down the reader or writer that does own it. Maybe a normal prefetch would be good.

If buffers are tied directly to queue entries, maybe you should just put them in the queue, as long as the MPMC library allows you to use custom reader code that reads a length and copies out that many bytes, instead of always copying a whole giant array.

Then every queue control entry that producers / consumers look at will be in a separate cache line, so there's no contention between two producers claiming adjacent entries.

If you need really big buffers because your upper bound is like 1MiB or something, retries because of contention will lead to touching more TLB entries, so a more compact ring buffer with the large buffers separate might be a better idea.

A reader half-way through claiming a buffer doesn't block other readers. It only blocks the queue if it wraps around and a producer is stuck waiting for it. So you can definitely have your readers use the data in-place in the queue, if it's big enough and readers are quick. But the more you do during a partially-complete read, the higher chance that you sleep and eventually block the queue.

This is a much bigger deal for producers, especially if the queue is normally (nearly) empty: consumers are coming up on newly-written entries almost as soon as they're produced. This is why you might want to make sure to prefetch the data you're going to copy in, and/or the shared buffer itself, before running a producer.

400 bytes is only 12.5 cycles of committing 32 bytes per clock to L1d cache (e.g. Intel Haswell / Skylake), so it's really short compared to inter-core latencies or the time you have to wait for an RFO on a cache write-miss. So the minimum time between a producer making the claim of a node globally visible to when you complete that claim so readers can read it (and later entries) is still very short. Blocking the queue for a long time is hopefully avoidable.

That much data even fits in YMM 13 registers, so a compiler could in theory actually load the data into registers before claiming a buffer entry, and just do stores. You could maybe do this by hand with intrinsics, with a fully-unrolled loop. (You can't index the register file, so it has to be fully unrolled, or always store 408 bytes, or whatever.)

Or 7 ZMM registers with AVX512, but you probably don't want to use 512-bit loads/stores if you aren't using other 512-bit instructions, because of the effects on max-turbo clock speed and shutting down port 1 for vector ALU uops. (I assume that still happens with vector load/store, but if we're lucky some of those effects only happen with 512-bit ALU uops...)

Thanks for the reply. A lot of information to digest. So I am responding to your last section first -- yes, passing 400 bytes in less than 300 nano is very short/quick. Hence, I think blocking the queue shouldn't even happen (assuming the lock prefix to a CAS instruction in x86 doesn't count as blocking. tho, it really blocks). My client has an exiting code achieving such latency but due to IP, I can't post here. But there is a "bug" in the code that the readers can't really detect the buffer is overwritten. That's why I raised this question. Let me read the rest of your answer. Thx! — HCSF, Aug 17 '18 at 06:12
Just a clarification for the latency claim for my client's code -- when there are more readers, the overall write-to-read latency goes up (e.g. going up 100% more when there are like 6 readers but 1 writer over a SHM in linux; which is still a mystery to me given readers have used no instruction with fence or lock prefix based on `objdump`). Just trying to be fair. — HCSF, Aug 17 '18 at 06:18
@HCSF: 300ns is 900 clock cycles on a 3GHz CPU. That's fairly quick for off-core, but within a single core it's ages. Dr. Bandwidth measured 240 ns inter-socket latency on an old Sandybridge-Xeon system, for simply ping-ponging a single cache line. (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/606433). The latency might be similar for your small fixed-size buffers, because multiple cache lines can be in flight at once. **Using buffers tied to nodes instead of putting *pointers* in the nodes means those loads don't have to wait for an address.** — Peter Cordes, Aug 17 '18 at 06:20
Within a single chip, inter-core latency can be as low as like 50ns for a quad-core desktop, or 16 ns between hyperthreads of a single physical core. Higher for many-core Xeons, because [the ring bus has more hops](https://stackoverflow.com/a/47787231/224132). SiSoft shows some inter-core latency benchmarks (https://www.sisoftware.co.uk/2017/06/24/intel-core-i9-skl-x-review-and-benchmarks-cache-memory-performance/) including a 10-core Skylake i9-7900X at 80ns. (Not part of a multi-socket system, though, which might matter because it prob. has to snoop the other socket.) — Peter Cordes, Aug 17 '18 at 06:26
Do you have a sample code to illustrate "Using buffers tied to nodes instead of putting pointers in the nodes means those loads don't have to wait for an address"? Maybe I misunderstand -- I assume each node will have something like uint8_t data[NODE_SIZE], and then when the reader loads, it still needs to do something like data[0], which essentially deferences and gets the address? I must misunderstand. — HCSF, Aug 17 '18 at 06:35
@HCSF: I was thinking something like `struct node_buf { size_t size; alignas(32) char data[4096-32]; };`. Then `queue` sort of works, but would copy all 4k bytes even if you're only using the first few. That's why you need custom read/write functions that know they only need to copy up to `size` bytes of `data`. Yes you have to "dereference" the size before reading the buffer, but it's in the same cache line as the data. (and branch prediction hides that latency anyway.) With just a pointer, it prob. wouldn't be in cache but that request couldn't start until the pointer arrived. — Peter Cordes, Aug 17 '18 at 06:40
Oh, I see. Then actually same thing can be done without a fixed node size. It can be something like [uint32_t][variable length data with alignment of 4]....not sure what advantage a fixed length brings yet...besides easier implementation...let me think about it more. — HCSF, Aug 17 '18 at 06:49
your link to [ring bus has more hops](https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug/47787231#47787231) is great. It explains why I see higher latency in SHM communication between cores in within the same socket in Skylake! Sigh — HCSF, Aug 17 '18 at 06:50
hugepage is a good one. AFAIK, linux only offers 2 sizes of hugepages -- 2MB or 1GB. If I have multiple 1MB, 4MB and 8MB shared memory segments as the ring buffers in this question, won't it be better to put all of them on one 1GB hugepage (assuming it is possible) such that I only need one TLB entry in each core? And it will likely stay hot as the producers and consumers will keep fetching one of the shared memory segments? — HCSF, Aug 17 '18 at 07:01
@HCSF You need it to be fixed-size so your ring buffer can be an array! Array indexing only works if elements are fixed-size. And if they're allocated only once at the start, they can't grow later, so they all need to be big enough. — Peter Cordes, Aug 17 '18 at 07:28
@HCSF: Linux uses 2M hugepages transparently for anonymous memory. I think you can use 1G hugepages explicitly, but that requires the OS to find 1GiB of contiguous physical memory. Doing that allocation could take a lot of defragging. It's possible I think, but inconvenient, to use 1G hugepages in user-space on Linux. Skylake only has 4 L1dTLB entries for 1G pages, vs. 32 for 2M pages (https://www.7-cpu.com/cpu/Skylake.html), and older was worse, so it can hurt performance. https://www.pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be/ looks interesting and maybe correct, but I only skimmed — Peter Cordes, Aug 17 '18 at 07:36
Right, I agree. I guess we are talking about 2 different things. I probably misunderstand your fixed length node. In my mind (at least boost::lockfree::queue), I was thinking about a ringbuffer (fixed length) with multiple nodes, each node fixed length as well. — HCSF, Aug 17 '18 at 07:39
yea, some people suggest to set kernel boot parameter to allocate 1GB hugepage at boot time to increase the success rate. Ideally, I should use smaller shared memory segments. Just that I haven't figured how to deal with readers being too slow yet. Hence, planning to use larger ones (as a hack) until I find a good MPMC ringbuffer with overwrite detection. Smaller shared memory size in total is also desired as everything can fit into L3 cache (hopefully). — HCSF, Aug 17 '18 at 08:20
The lockless MPMC Q&A I linked is about the implementation in `liblfds`, which detects and blocks on queue full instead of silently overwriting unread entries. See the section in my answer for very brief summary of how it works, and that Q&A (or the code itself) for more details. — Peter Cordes, Aug 17 '18 at 08:28

Multiple Producer Multiple Consumer Lockfree Non Blocking Ring Buffer With Variable Length Write

2 Answers2