Lock-free Progress Guarantees in a circular buffer queue

Question

Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.

This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).

However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.

One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.

The relevant algorithm is in lfds711_queue_bmm_enqueue.c. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.

The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index and write_index. Each slot in the queue contains a field for user-data, and a sequence_number value, which is basically like an epoch counter. (This avoids ABA issues).

The PUSH algorithm is as follows:

Atomically LOAD the write_index
Attempt to reserve a slot in the queue at write_index % queue_size using a CompareAndSwap loop that attempts to set write_index to write_index + 1.
If the CompareAndSwap is successful, copy the user data into the reserved slot.
Finally, update the sequence_index on the slot by making it equal to write_index + 1.

The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:

bool mcmp_queue::enqueue(void* data)
{
    int write_index = m_write_index.load(std::memory_order_relaxed);

    for (;;)
    {
        slot& s = m_slots[write_index % m_num_slots];
        int sequence_number = s.sequence_number.load(std::memory_order_acquire);
        int difference = sequence_number - write_index;

        if (difference == 0)
        {
            if (m_write_index.compare_exchange_weak(
                write_index,
                write_index + 1,
                std::memory_order_acq_rel
            ))
            {
                break;
            }
        }

        if (difference < 0) return false; // queue is full
    }

    // Copy user-data and update sequence number
    //
    s.user_data = data;
    s.sequence_number.store(write_index + 1, std::memory_order_release);
    return true;
}

Now, a thread that wants to POP an element from the slot at read_index will not be able to do so until it observes that the slot's sequence_number is equal to read_index + 1.

Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.

But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".

Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.

So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.

So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?

Mutexes interact with the OS's scheduler and this algorithm does not, how could they possibly be equivalent? (maybe you meant a spin-lock?) — , Aug 27 '17 at 17:13
Working on an answer. I think it's technically not lock-free, because a thread that sleeps for a long time can keep a slot tied up, eventually having the queue wrap around and have writers and readers all stuck waiting for one thread to finish with its slot. But I don't think you could get this behaviour this efficiently with locking. Maybe with per-element mutexes, and atomic fetch_add for `read_index` and `write_index`. In practice, an algorithm doesn't have to be 100% lock-free to be useful or performant. It just has to not actually block with a large enough queue on real HW + OS. — Peter Cordes, Aug 27 '17 at 19:43
@Frank - IMO a spin-lock is simply an particular implementation of a mutex. Do you have any reference that indicates that a mutex has to be allow graceful blocking or interact with the scheduler? As I see it, any implementation which allows mutual exclusion is fine and graceful blocking is only a QOI issue (the argument might be different on non-preemptive operating systems). — BeeOnRope, Aug 27 '17 at 22:48
@PeterCordes - you can get this behavior efficiently (broadly speaking) without locking. The usual approach is to flip everything around and stop making the update of `m_write_index` the point at which the element is inserted, but instead treat it as a "hint" of where to insert. The actual important operation then is a CAS of `s.data` from NULL (or some other reserved value) to the user data. At that single point the value is considered "in" the structure and then `m_write_index` can be updated by the pushing thread, but even if it doesn't get updated, the structure keeps working. — BeeOnRope, Aug 27 '17 at 23:20
I.e., the next writer will end up seeing that `m_write_index` happens to point to a full slot and just look for the next empty slot and proceed as usual, leading to the cooperative behavior that's often necessary for something to be lock free as mentioned by the OP. Note that I'm not making any claim that this structure will perform as well than the one above in practical cases (but neither is it too terrible), but it should still be relatively efficient. — BeeOnRope, Aug 27 '17 at 23:22
@Frank: The actual library uses `LFDS711_BACKOFF_EXPONENTIAL_BACKOFF( qbmms->dequeue_backoff, backoff_iteration );` IDK what that does, but it may invoke OS support after some tries. That's not really the point, though. The point is that if you find that you could get equivalent behaviour with an actual mutex, should you use a mutex? Or are you still gaining anything by using atomics. (In this case I think you are.) — Peter Cordes, Aug 28 '17 at 00:16
As a side note, I believe the first `load` should use at least `acquire` ordering — LWimsey, Aug 29 '17 at 17:34
@LWimsey: I don't think anything bad happens with `m_write_index.load(relaxed);`. The `acq_rel` CAS is where order matters. It's typical to use a relaxed load to set up for a CAS. Readers don't touch `m_write_index`, so it could only sync with other writers. In practice, the compiler will probably generate code where the first `s.sequence_number.load(mo_acquire)` is dependency-ordered (like what `mo_consume` guarantees) because the load address has a data dependency on `write_index`. But I think it's safe even on Alpha AXP (which doesn't enforce ordering for dependent loads). — Peter Cordes, Sep 29 '17 at 15:00
@LWimsey: And I don't think it's possible for any load reordering to lead to spurious queue-full, so at worst you get a CAS retry. Remember that the global visibility of stores to sequence number and `m_write_index` are still ordered, so there are limits to what load-reordering can observe. — Peter Cordes, Sep 29 '17 at 15:03

score 19 · Accepted Answer · edited Sep 28 '17 at 09:33

This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:

A structure is lock-free if only if any thread can be indefinitely suspended at any point while still leaving the structure usable by the remaining threads.

Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.

In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.

There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.

Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.

Performance

Uncontended Performance

The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.

This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)¹, and little other overhead.

This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.

So I expect this queue to perform reasonably well in the uncontended fast-path.

Contended Performance

One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.

This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.

Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.

Context-switch Immunity

One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.

Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.

This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.

While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!

Functional Aspects

Async Thread Termination

On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.

This is not the case for this queue, as described above.

Queue Access from Interrupt or Signal

A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.

This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.

The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.

¹ In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.

I was trying to find a way for the reader and writer to use `fetch_add` (x86 `lock xadd`) instead of CAS to avoid the need for retry when incrementing `m_write_index`. But the current algo only attempts the CAS with a `write_index+1` obtained *before* reading and checking the `sequence_number` of that index. `fetch_add` could result in multiple increments from multiple threads. And if you `fetch_add` *before* checking the sequence number, and then have to `fetch_sub` it to correct an overshoot on queue full, it gets super messy. In theory fetch_add is perfect to get each thread a unique idx :/ — Peter Cordes, Aug 28 '17 at 03:35
The readers blocking on a stalled writer isn't affected by queue capacity, but it is affected by queue average fullness. If the queue is usually nearly empty, this will be a problem right away. But if it's usually nearly full, you have a longer window before readers get to the stalled write slot. (However, it will only be nearly-empty if your system bottlenecks on producer throughput rather than consumers, so stalling consumers for a while may be ok if it doesn't fill up before the stalled write finishes; consumers can catch up. Of course that's bad for latency!) — Peter Cordes, Aug 28 '17 at 03:43
I'm curious: could you implement a queue with equal or better performance characteristics using only traditional locking, with no atomics / fences? (Not even store or load; i.e. no C++11 data race UB without `#include `). Pretty clearly not for the shared `m_write_index` / `m_read_index`, but what if you do that part with atomics and use locking somehow, e.g. actual per-element semaphore? Reader tries to down() the semaphore (so it waits if the counter was zero for an unwritten element), writer up()s the semaphore? But queue-full detection is a problem. Maybe 2 locks per slot? — Peter Cordes, Aug 28 '17 at 03:55
I have made concurrent structures with "unconditional" operations like `lock xadd` in the past, including with "overshoot correction", but unless things have changed on very recent chips it never really pays off: CAS was always about the same cost as the other `lock` instructions, and unless contention was off the charts the retry was more of a theoretical problem than a real one. You can efficiently track retries and it was common too see "contended" structures with billions inserts with a handful of retries. Of course, "undoing overshoot" is just another type of retry! @PeterCordes — BeeOnRope, Aug 28 '17 at 04:00
@PeterCordes - yes, the likelihood of the queue becoming spuriously empty is a function of the queue fullness, the `pop` throughput, and the context switch "off CPU" time. It doesn't need a nearly empty queue though: if the throughput and "off time" are high then you can stall even if the queue started full (so in that respect it does depend a bit on capacity). I'll update it to mention that fullness is a factor. — BeeOnRope, Aug 28 '17 at 04:04
@PeterCordes - well you'd have to define "traditional locking", but some ways that locking can outperform lock-free _in general_ are things like using a scalable rw-lock (one where readers don't all contend for a single shared variable but split their accesses across various cache lines) to lock a single-threaded structure, with lots of read access or (b) cases where the number of atomic ops for the lock-free structure is say > 2 and so the locking version does less total work or more generally when the locking can be amortized across several operations... — BeeOnRope, Aug 28 '17 at 04:15
..., e.g., an iteration which only has to lock once. On x86 it's a bit harder to find example because the "read path" for many lock free structures may not involve any atomics at all so it's harder to beat. On the specific question of the queue, I don't see an obvious way, as the queue is already a pretty crappy structure with "inherent" contention since it orders elements. Fine-grained locking as you suggest sometimes makes structures or operations on structures _possible_ that you couldn't even do with a lock-free approach so you can't always compare apples to apples. — BeeOnRope, Aug 28 '17 at 04:19
Also contrast lock-free and traditional locking with userspace RCU and seqlocks which are both potent techniques when they apply. — BeeOnRope, Aug 30 '17 at 04:18

score 12 · Answer 2 · answered Feb 18 '19 at 21:26

I am the author of liblfds.

The OP is correct in his description of this queue.

It is the single data structure in the library which is not lock-free.

This is described in the documentation for the queue;

http://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_%28bounded,_many_producer,_many_consumer%29#Lock-free_Specific_Behaviour

"It must be understood though that this is not actually a lock-free data structure."

This queue is an implementation of an idea from Dmitry Vyukov (1024cores.net) and I only realised it was not lock-free while I was making the test code work.

By then it was working, so I included it.

I do have some thought to remove it, since it is not lock-free.

score 2 · Answer 3 · answered Sep 18 '17 at 19:41

2

Most of the time people use lock-free when they really mean lockless. lockless means a data-structure or algorithm that does not use locks, but there is no guarantee for forward progress. Also check this question. So the queue in liblfds is lockless, but as BeeOnRope mentioned is not lock-free.

answered Sep 18 '17 at 19:41

Saman Barghi

354
1
8

You are confusing lock-free with wait-free. – Maxim Egorushkin Sep 18 '18 at 09:25
2

@MaximEgorushkin nope, wait-free means forward progress for each individual thread is guaranteed, but lock-free means forward progress for the system as a whole is guaranteed but not for individual threads, [check here](http://www.1024cores.net/home/lock-free-algorithms/introduction). But lockless means there is no guarantee for any kind of forward progress if a single thread is interrupted or exits in the middle of spinning. It requires studying some code to fully grasp the concept. – Saman Barghi Sep 21 '18 at 17:31

Matt Timmermans · Answer 4 · 2017-08-27T17:37:08.533

1

A thread that calls POP before the next update in sequence is complete is NOT "effectively blocked" if the POP call returns FALSE immediately. The thread can go off and do something else. I'd say that this queue qualifies as lock-free.

However, I wouldn't say that it qualifies as a "queue" -- at least not the kind of queue that you could publish as a queue in a library or something -- because it doesn't guarantee a lot of the behaviors that you can normally expect from a queue. In particular, you can PUSH and element and then try and FAIL to POP it, because some other thread is busy pushing an earlier item.

Even so, this queue could still be useful in some lock-free solutions for various problems.

For many applications, however, I would worry about the possibility for consumer threads to be starved while a producer thread is pre-empted. Maybe liblfds does something about that?

edited Aug 27 '17 at 17:37

answered Aug 27 '17 at 17:29

Matt Timmermans

53,709
3
46
87

1

Some lock libraries have a `trylock` function which returns failure if the lock wasn't available, but being able to do something *else* in that thread doesn't mean the queue algorithm is lock-free. That requires that at least one thread in the system can make forward progress *on that algorithm* at any time. https://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom. – Peter Cordes Aug 27 '17 at 18:06
The queue in question is lock-free, but that property doesn't automatically extend to algorithms that use it. In the same way you can make a spin lock from lock-free primitives, you can get threads to block each other using this queue if you want. – Matt Timmermans Aug 27 '17 at 23:35
1

The push and pop algorithms are separately lock-free, but the queue as a whole is not even obstruction-free. A writer that sleeps indefinitely between CAS and release-store eventually leaves both writes and reads blocks (effective queue size of zero), as BeeOnRope explains nicely. At this point it degenerates into a `trylock` situation. I think in practice it will be vanishingly rare for a process to use up its timeslice between CAS and store. You make a good point that a blocked writer can block readers right away if the queue is near-empty. That could be a problem in practice. – Peter Cordes Aug 28 '17 at 00:12

Tsyvarev · Answer 5 · 2017-08-28T08:54:53.930

1

"Lock-free" is a property of the algorithm, which implements some functionality. The property doesn't correlate with a way, how given functionality is used by a program.

When talk about mcmp_queue::enqueue function, which returns FALSE if underlying queue is full, its implementation (given in the question post) is lock-free.

However, implementing mcmp_queue::dequeue in lock-free manner would be difficult. E.g., this pattern is obviously not-lock free, as it spins on the variable changed by other thread:

while(s.sequence_number.load(std::memory_order_acquire) == read_index);
data = s.user_data;
...
return data;

edited Aug 28 '17 at 08:54

answered Aug 27 '17 at 21:09

Tsyvarev

60,011
17
110
153

2

I don't think this a valid assessment of the functionality offered by the queue. If a thread is killed after reserving its slot but before writing it, the queue is essentially broken. It will report that it is both simultaneously full and empty, even to a single thread. So unless the _functionality_ of this queue involves the caveat that it may permanently appear to have a fixed length of zero (making it useless) the queue doesn't offer its functionality in a lock-free way. – BeeOnRope Aug 27 '17 at 23:24
`If a thread is killed after reserving its slot but before writing it, the queue is essentially broken. It will report that it is both simultaneously full and empty, even to a single thread.` - That would mean the algorithm itself is **incorrect**, because it reports *fullness* when it shouldn't (or reports *emptiness* when it shouldn't). No reason to check incorrect algorithm for lock-freedom. – Tsyvarev Aug 28 '17 at 08:46
1

Yes, I agree we're in a grey area here. One could argue however that the the authors are perfectly aware of this behavior, and so the queue is behaving as designed. So it it a lock-free queue? It certainly doesn't operation exactly as you'd expect a lock-free queue to do, but is that because it is not **lock-free** or because it is not a **queue** or because it is **incorrect**? At this point it is mostly semantics, but I think you can still say that a structure that is permanently impacted by indefinite suspension of any thread should not qualify as lock-free. – BeeOnRope Aug 30 '17 at 21:26
Yes, it is actually an *incomplete* question, which doesn't show `dequeue` function. Without it, it is difficult to say anything even about **correctness** of `enqueue`. But, if talk only about `enqueue` and *assuming its correctness*, the function by itself seems to not suffer from "indefinite suspension" when run concurrently with itself: every its iteration takes new `m_write_index`. So, `enqueue` **function itself is lock-free**. But I agree that `dequeue` function *probably* cannot be implemented in lock-free manner, so the whole **data-structure is unlikely lock-free**. – Tsyvarev Aug 30 '17 at 22:37
2

Yes, I think the implementation of dequeue is _implied_ by the `enqueue` code and is fairly obvious (in fact, relevant parts of the `dequeue` operation are described, but only in prose), but even only looking at enqueue you can see in the case of infinite suspension that the `enqueue` method will stop actually adding elements after `m_num_slots` additional elements have been added. So regardless of how many elements have been popped the method will "fail" eventually. Finally, my assumption is that the question is about the structure, despite that only the `enqueue` function was provided. – BeeOnRope Aug 30 '17 at 23:33

Eric · Answer 6 · 2019-04-04T11:33:33.997

I did formal verification on this same code using Spin a couple years ago for a course in concurrency testing and it is definitely not lock-free.

Just because there is no explicit "locking", doesn't mean it's lock-free. When it comes to reasoning about progress conditions, think of it from an individual thread's perspective:

Blocking/locking: if another thread gets descheduled and this can block my progress, then it is blocking.
Lock-free/non-blocking: if I am able to eventually make progress in the absence of contention from other threads, then it is at most lock-free.
If no other thread can block my progress indefinitely, then it is wait-free.