Truly Lock-free MPMC Ring Buffer? Threads able to assist each other to avoid blocking?

Question

This question is inspired by Lock-free Progress Guarantees. The code shown is not strictly lock-free. Whenever a writer thread is suspended when the queue is not empty or not full, the reader threads returned false, preventing the whole data structure from making progress.
What should be the correct behavior for a truly lock-free ring buffer?

Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation.

Any reference to this technique? How could it be implemented for an array-based MPMC queue?

I looked at some codes, and they have similar problems.

craflin's version uses two internal atomics for recording read and write.
https://codereview.stackexchange.com/questions/263157/c-lock-free-mpmc-ring-buffer-in-c20 uses busy-waiting, not lock-free.

FYI, I think most of the time, the assist stuff is mostly relevant as an exercise in computer-science, not engineering. Because it's so complex and hard to get right, and I assume comes with a performance cost for the common case. So if you care about average throughput, you don't actually want this, you just make your queue big enough. (That does *not* invalidate the question, more like a hint that looking in real-world implementations is probably the wrong place to find an implementation of these techniques. I'm a curious, too, exactly how hairy it would get to do this for a MPMC queue.) — Peter Cordes, Apr 03 '22 at 12:29
Thanks, @Peter, your comments(including those under the references) are inspiring to me. I care more about engineering, so making the queue bigger, i.e. making the average fullness smaller is a throughput solution. — Nimrod, Apr 03 '22 at 13:15
Note that the `liblfds` implementation is often a great compromise. It can lead to waiting when the queue is almost empty, but that's often exactly when you can afford waiting. Real lock-free solutions are possible, but I don't know of any that aren't significantly slower or more complicated. — Matt Timmermans, Apr 03 '22 at 14:08
In particular garbage collection is always always an issue with these algorithms, so I'm not sure that it really makes sense specifically to have a ring buffer. — David Eisenstat, Apr 03 '22 at 15:08
@DavidEisenstat: Holding values directly in the queue (like a struct of several integers in size) avoids a deallocation problem. Even if the items you pass through the queue point to other data, you know that exactly 1 thread has pulled that data from the queue and now owns it. No other thread will be dereferencing those pointers, so that reader can deallocate any pointed-to memory. (But not the queue node itself; that's part of an array). Even just passing a single pointer as the queue payload achieves this; you don't deref it until after claiming ownership via a queue read. — Peter Cordes, Apr 03 '22 at 22:10
@PeterCordes right, that part is not specific to truly lock-free queues and is not the problem. The problem is when threads start groveling in each other's incomplete requests. — David Eisenstat, Apr 03 '22 at 22:42
@DavidEisenstat: Oh right, forgot what the topic was for "these algorithms". But Matt's answer suggests that it's doable without extra storage, in a flat array ring buffer. As long as nobody's dereferencing any pointers copied out of the queue until they've confirmed a complete read that gives them ownership, you still don't have a problem. — Peter Cordes, Apr 04 '22 at 00:16

score 2 · Accepted Answer · answered Apr 03 '22 at 16:01

As a good example of how cross-thread assist often ends up working in real life, consider that a lock-free MPMC queue can be obtained by changing the liblfds algorithm along these lines:

Use 3 counters:

alloc_pos: the total number of push operations that have been started. This is incremented atomically when a push starts.
write_pos: all write operations at lower positions are known to be complete.
read_pos: all items written at lower positions are known to have been consumed.

In this scheme, a push or pop operation is completed by a CAS in the affected slot. The write_pos and read_pos variables are redundant.

So to push, a thread first increments alloc_pos, and then increments write_pos past all the slots in front of it that it can see are complete. This is an assist -- it is completing previous writes that were started in other threads. Then the thread has to scan the slots between write_pos and alloc_pos until it finds a free one and manages to reserve it with a CAS.

To pop, the reader first increments read_pos past all the items written at lower positions that it can see are consumed. Again this is an assist -- completing previous reads. Then it scans from read_pos to alloc_pos to see if it can find an item that is finished being written.

As mentioned in comments, doing this for real gets annoying, with implementation decisions trading off performance against which ordering and availability guarantees you need, as well as jumping through hoops to prevent ABA problems.

I think I basically get the idea. Whenever a read/write is to be done, it first **assists** the previous slots to complete. Instead of reading/writing the next position, it finds a suitable position in a range. So when a thread is suspended, the whole data struct still works. But it seems to need much more extra work for correctness, like additional check to avoid multiple write/read on the same slot, the `_pos` will turn to be `0`, and so on... Also, the assist process has a non-negligible cost. Anyway, from theoretical perspective, I think it's a solution. — Nimrod, Apr 04 '22 at 03:39

Truly Lock-free MPMC Ring Buffer? Threads able to assist each other to avoid blocking?

1 Answers1