Lockfree Stack with atomic

Question

I need to build a lock-free stack implementation. I read this page and I understand the functionality of the listed lock-free push operation.

Now, I have to build a similar version of the pop operation. This is what I've done until now but I think, there are some concurrency problems:

template <class T>
bool CASStack<T>::pop(T& ret) {
node<T>* old_head = head.load(std::memory_order_relaxed);

if(old_head == nullptr) {
    return false;
}

// from here on we can assume that there is an element to pop
node<T>* new_head;

do {
    new_head = old_head->next;
} while(!head.compare_exchange_weak(old_head, new_head, std::memory_order_acquire, std::memory_order_relaxed));

ret = old_head->data;

return true;
}

I think I also will get trouble if I delete old_head after the swap, right?

EDIT: Updated question!

Yes, deletion is hard because other threads could still be reading it. RCU has the same problems, and and solves it by deferring deletion until all threads have passes a sync point. (IIRC, with a generation number or something, so you know which pool of old objects can be deleted now). Wikipedia has a diagram. https://en.wikipedia.org/wiki/Read-copy-update#Overview — Peter Cordes, Dec 11 '17 at 15:10
You should put `new_head = old_head->next;` inside the CAS retry loop. Think about which variable `compare_exchange_weak` updates on failure. — Peter Cordes, Dec 11 '17 at 15:15
Yes, you're right! I'll insert this into the loop. What if we ignore the deletes for a moment: Is it allowed to do old_head->next within compare_exchange_weak? Is this still atomic? — lukasl1991, Dec 11 '17 at 16:01
What do you mean "within"? If you mean inside a `do { new_head = old_head->next; } while(! CAS)` loop, then yes, that's what you should do. Any `compare_exchange_weak` that compiles is itself atomic, the only issue is whether the pointer you're trying to replace `head` with is pointing to an object with the right contents. (And getting the rest of the logic right.) It's possible to go wrong if your algorithm actually requires (for correctness) multiple updates (to different objects) to be an atomic transaction; i.e. if you designed it wrong. But that's not the case here. — Peter Cordes, Dec 11 '17 at 16:03
I thought about `while(!head.compare_exchange_weak(old_head, old_head->next, std::memory_order_release, std::memory_order_relaxed)) { new_head = old_head->next; }` — lukasl1991, Dec 11 '17 at 16:07
Use a `do{}while()` structure so you don't also need a separate `new_head = old_head->next;` outside the loop. Any time you have a loop that will always run at least once, `do{}while()` is a good choice. — Peter Cordes, Dec 11 '17 at 16:08
Why should the pointer I'm trying to replace head with not point to an object with right contents? Only in case if ´old_head->next´ is a nullpointer?! I don't understand the thing you mean with multiple updates!? — lukasl1991, Dec 12 '17 at 07:43
In the last sentence of [that comment](https://stackoverflow.com/questions/47755527/lockfree-stack-with-atomic?noredirect=1#comment82474562_47755527), I was talking about a totally different case. e.g. a doubly linked list with head and tail pointers. Removing the last element would have to set them both to NULL at the same time, but you can't do that with two separate CAS operations. So even though each CAS itself is atomic, the algorithm is broken (and needs to be more complex, or cooperate with readers on some ordering) because it depends on atomicity for a group of updates. — Peter Cordes, Dec 12 '17 at 07:55
After looking at your code again, I realized what it's actually doing. The comments about `new_head` are highly misleading; I didn't read the giant CAS line until not to notice it wasn't doing what I expected based on the comments and other code. — Peter Cordes, Dec 12 '17 at 08:28
You need (at least) memory order `acquire` for the pop operation, otherwise objects passed from one thread to another through the stack won't be safely published (i.e,. the receiving thread may see the object in a partially constructed, or otherwise partially modified state). Release is only appropriate for `push`. — BeeOnRope, Dec 13 '17 at 00:21
Ah I've read about [that issue](https://stackoverflow.com/questions/47755527/lockfree-stack-with-atomic#comment82496388_47755527) with the doubly linked list. In this case, one could use the technique called "pointer marking", right? And I fixed [this](https://stackoverflow.com/questions/47755527/lockfree-stack-with-atomic#comment82497302_47755527), so now I have `new_head` as the second argument of `compare_exchange_weak` — lukasl1991, Dec 13 '17 at 12:58
@BeeOnRope: I noticed that, too. See the bottom section of my answer. — Peter Cordes, Dec 13 '17 at 14:19
@BeeOnRope: OP's comment made me realize we need `consume` (or `acquire`) not just on the CAS, but also on the first load, because need to deref it and get `old_head->next`. I say technically because everything except Alpha gives us `consume` dependency-ordering for free when the compiler can't find a non-data-dependent way to get the next address, but multi-threaded code is hard enough without trying to outsmart or lie to the compiler. Anyway, updated my answer. — Peter Cordes, Dec 14 '17 at 09:42
@PeterCordes - so you got me to look a the code again. It seems that the `pop` call _doesn't return the popped element_. Yes, that's the "classic SRP" design that C++ loves but it makes little sense in an atomic stack since you'll never know what element you popped. So my comment above isn't really valid: since this method isn't returning the element, you don't really need an `acquire` barrier to protect the callers. You _do_ need the acquire order on the initial load to make `old_head->next` safe, as you point out! After that, you can use relaxed for the CAS AFAICT... — BeeOnRope, Dec 14 '17 at 18:32
I don't think the CAS needs acquire, because all it needs is a consistent modification order on `head` which is already guaranteed by `atomic`. The acquire/release semantics for internal use (i.e., to access `->next` just flow through the load). The acquire/release semantics for the use of the API are hard to determine this code doesn't accept/return nodes. @lukasl1991 - you probably want to return the popped node. It isn't safe to check `top` and then use `pop` to get rid of that element since the stack may change in the meantime. Usually `pop` returns the popped object in concurrent designs. — BeeOnRope, Dec 14 '17 at 18:36
@BeeOnRope, yes I understand. I pass a reference to pop and want the return the poped value by this reference. Boost and other libraries also do so and return a boolean only — lukasl1991, Dec 14 '17 at 19:22
I'm sorry. Thought it would be enough to post the function body. But my paste is complete: https://pastebin.com/L34h9Xsk — lukasl1991, Dec 14 '17 at 20:39
@BeeOnRope: The version in my answer returns a pointer to the popped element (or `nullptr`), because I noticed that missing in the question. I guess that's bogus if it needs to delay `delete`ing it, though. — Peter Cordes, Dec 14 '17 at 21:10
@lukasl1991 - you didn't post the function body: the critical line `ret = old_head->data;` doesn't appear at all! How you return the object is a critical part of the implementation and thread-safety. Even that line isn't enough, because you'd certainly need to see the declaration of `ret` to understand what's going on. Here you make a copy, which is a nice way to side-step the memory release issues, but it also has several other performance and semantic implications. BTW, you probably want `ret = std::move(old_head->data)` for performance. — BeeOnRope, Dec 14 '17 at 21:21
@lukasl1991 - I'm not sure, but if you have a separate question, you should probably create a separate Q&A on it. The comment thread isn't really the place to ask brand-new questions. — BeeOnRope, Dec 17 '17 at 06:29
Essentially, my question was to implement a lock-free pop operation. Therefore I think the nullpointer issue belongs to this question. Nevertheless, I'm going to post my current solution as an answer for better overview. — lukasl1991, Dec 17 '17 at 10:37

Peter Cordes · Accepted Answer · 2021-06-05T02:12:04.360

Your node<T>* new_head = old_head->next; is a red herring; you never use this variable.

In my comments suggesting you needed to put it inside a do{}while(!CAS) loop, I was thinking you were doing head.CAS(old_head, new_head). This would have the problems I was talking about, of putting a possibly-stale pointer into the list if the CAS did have to retry.

But you're actually doing head.CAS(old_head, old_head->next) which generates the "desired" value from the updated old_head every time through the loop. This is actually correct, but hard to follow, so I'd suggest using a do{}while() like so:

// FIXME: this may suffer from ABA problems; see other answers.
node<T>* pop(std::atomic<node<T>*> &head)
{
    // We technically need acquire (or consume) loads of head because we dereference it.
    node<T>* old_head = head.load(std::memory_order_acquire);

    node<T>* new_head;
    do {
        if(old_head == nullptr) {
           // need to re-check because every retry reloads old_head
           // pop in another thread might have emptied the list
            return nullptr;
        }

        new_head = old_head->next;
        // if head still equals old_head this implies the same relation for new_head
    } while(!head.compare_exchange_weak(old_head, new_head,
                                        std::memory_order_acquire));
    // Note the ordering change: acquire for both success and failure

    return old_head;  // defer deletion until some later time
}

(To deal with possible ABA problems, a pointer+sequence-number struct may be needed. Doing that in a way that still allows efficient loads of just the pointer can be pretty hacky: How can I implement ABA counter with c++11 CAS?. See also other answers on this question which address the ABA problem; I wrote this answer a while ago and don't guarantee that the whole thing adds up to a usable lock-free stack!)

Is it allowed to do old_head->next within compare_exchange_weak? Is this still atomic?

The CAS is still atomic. Any compare_exchange_weak that compiles is itself atomic. The compiler evaluates the args before the function call, though, so reading old_head->next isn't part of the atomic transaction that CAS does. It's already been read separately into a temporary. (Doing this explicitly with a separate variable like in the do{}while loop is common.)

If node::next is an atomic<> member of node, you should think about what memory order you want to use for that load. But for a pure stack, it doesn't have to be atomic, because linked-list nodes are never modified while they're on the stack, only before being pushed with the right next pointer. Shared read-only access is not a race.

Usage as a pure stack also reduces deletion problems: threads can't "peek" at the head node or traverse the list. They can only look inside a node after popping it, and the pop algorithm ensures they have exclusive ownership of the node (and are responsible for deleting it).

But pop() itself needs to load from the head node. If another thread races with us and returns the memory for that head to the OS, we could fault. So we do have a deletion problem like RCU does, like I mentioned in a comment.

Simply reusing the memory for something else wouldn't be a problem on most C++ implementations, though: we would read a garbage value for old_head->next, but CAS would fail (because the head pointer must have changed before the old head object was freed) so we'd never do anything with the bogus value we loaded. But it's still C++ UB for our atomic load to race with a non-atomic store. But a compiler would have to prove that this race actually does happen before it's allowed to emit anything other than normal asm, and all mainstream CPUs don't have any problem with such a race in asm.

But unless you can guarantee that free() or delete just put the memory on a free list, i.e. that they don't munmap it between a load of head and a deref of old_head->next, the above reasoning doesn't make it safe for the caller to delete pop's return value right away. It only means problems are very unlikely (and hard to detect with simple testing).

Memory ordering

We load head and then expect that pointer to point to useful values. (i.e. old_head->next). This is exactly what memory_order_consume gives us. But it's hard to use, and so hard to optimize that compilers just strengthen it to acquire, which makes it impossible to test code that uses consume. So we really want acquire for all our loads of head.

(Using consume with current compilers is equivalent to acquire. If you actually need the performance of data dependency ordering with no barriers, see C++11: the difference between memory_order_relaxed and memory_order_consume for how to try to safely use relaxed.)

Note that getting the value out of the node we pop also depends on memory ordering, but I think if we didn't need old_head->next we could use relaxed everywhere but in the success side of the CAS (where we would need at least consume, so in practice acquire).

(On mainstream C++ implementations we could probably get away with relaxed on all architectures except DEC Alpha AXP, the famously weakly-ordered RISC from the 90s. The compiler will almost certainly create code with data dependencies on the loaded pointer, because it doesn't have any other way to access the values it needs. And all "normal" hardware except Alpha provides the mo_consume style dependency ordering for free. So testing with relaxed would never show problems unless you had one of the rare models of Alpha that actually could produce this reordering in hardware, and a working C++11 implementation for it. But it's still "wrong", and could potentially break with compile-time reordering, or maybe I'm missing something and relaxed might actually break in practice without inlining into something more complex + constant-propagation.)

Note that these mo_acquire loads synchronize-with the mo_release store in the thread that pushed the object pointed to by the current head. This prevents our non-atomic loads from old_head from racing with non-atomic stores to the node in the thread that pushed it.

My code looks now like [this](https://pastebin.com/L34h9Xsk). The program crashes with received signal SIGSEGV, Segmentation fault when dereferencing the pointer `new_head = old_head->next;` Whats the problem? — lukasl1991, Dec 14 '17 at 08:53
@lukasl1991: oh, we forgot that the `nullptr` check is needed every time `old_head` is updated. Derp. (I assume that's the problem if your code looks the code in my answer.) — Peter Cordes, Dec 14 '17 at 08:59
My code looks similar. The only difference is the signature `template bool CASStack::pop(T& ret) {` But `if(old_head != nullptr)` would not solve the problem in my mind. Do I need an additional CAS there? — lukasl1991, Dec 14 '17 at 14:53
+1 on "use acquire not consume". That's my policy: use acquire, it is simple to reason about [1], well supported, doesn't require you and your compiler to understand this fairly crazy "carries dependency" junk. In the very rare case you hit the combination of platform and use-case where this matters, carefully optimize just the few places it matters to use consume and test carefully. [1] Relatively speaking, anyways! — BeeOnRope, Dec 14 '17 at 18:41
@BeeOnRope: I think you literally can't get full `consume` performance from current compilers unless you use `relaxed` and cross your fingers, because they literally treat it as a synonym for `acquire` until anyone figures out how to correctly implement the carries-dependency stuff. If `relaxed` doesn't give you correct asm (data-dependent loads) for your target platform, I guess you'd have to write asm by hand? — Peter Cordes, Dec 14 '17 at 21:14
@PeterCordes - wasn't it IBM pushing for consume due to POWER? You'd think they at least submit some compiler patches to use this now that they browbeat the committee into accepting it. — BeeOnRope, Dec 14 '17 at 21:23
@BeeOnRope: [Herb Sutter's mentioned this in a talk](https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/), but he only mentioned them pushing for having anything weaker than seq_cst as an option. (Which I thought was insane; surely release/acquire aren't so crazy. Yes they're harder, but without weaker orderings kernel developers would have considered C++11 a total joke, rather than just being a maybe-worse way to get what they already have with GNU C inline asm. (I think that's more or less what Linus Torvalds has said about using C11 in Linux)). — Peter Cordes, Dec 14 '17 at 21:30
@PeterCordes: I am not sure if I understood memory ordering correctly. As you suggested, my pop algorithm uses acquire now. But the memory orders in the [push algorithm](http://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange) `head.load(std::memory_order_relaxed)` and `head.compare_exchange_weak(new_node->next, new_node, std::memory_order_release, std::memory_order_relaxed)` are still correct? — lukasl1991, Dec 19 '17 at 09:19
@lukasl1991: yeah, that looks right. Remember that `push` is releasing its private object to other threads, but `pop` needs acquire the correct data for the object pointed to by `head` even before the CAS succeeds. So it makes sense that the requirements are different. (And in this case, the English language meaning of "acquire" and "release" aligns with the technical meaning of `mo_acquire` and `mo_release`.) — Peter Cordes, Dec 19 '17 at 14:38
Dear @PeterCordes, I am deeply confused about one aspect of your example: the load of `old_head->next` is a dependent load, i.e. its value must be observed before the load to `next` can even be issued. For this reason, it seems to me that the *acquire* ordering both when loading `old_head`, and in the failure argument of `compare_exchange_weak` are excessive and `relaxed` should be used. Could you clarify? Thanks you in advance! — Wenzel Jakob, Nov 24 '20 at 22:54
@WenzelJakob: In ISO C++, you need `mo_consume` to get dependency ordering, not `mo_relaxed`. (i.e. the C++ memory model is weak enough to support DEC Alpha, which would need a barrier for this). But `mo_consume` is deprecated, and promoting to `mo_acquire` is recommended until the C++ committee designs a `consume` that compilers can actually implement. See links in [\[\[carries\_dependency\]\] what it means and how to implement](https://stackoverflow.com/q/64113244) — Peter Cordes, Nov 24 '20 at 23:54
@PeterCordes -- thank you for clarifying. Your point regarding such extremely weakly ordered architectures makes a lot of sense from a rigorous standard-compliance point of view. But given that DEC Alpha hasn't been relevant for decades, my interest of getting these ordering specifications right is mainly with regards to ARM/AArch64. Would you say that my point above regarding repacing `acquire` with `relaxed` is valid here? (I'm thinking it can be used in both the `.load()` and both success/failure arguments of `.compare_exchange_weak`) Thank you in advance! — Wenzel Jakob, Nov 25 '20 at 17:22
@WenzelJakob: If you *need* the performance of `mo_consume` with current compilers, your only option is to use `relaxed` *and be careful with it*, i.e. make sure the compiler can't easily optimize to branch on the value instead of a data dependency. (Because control dependencies don't "carry a dependency" - that's the danger on non-Alpha CPUs). See [C++11: the difference between memory\_order\_relaxed and memory\_order\_consume](//stackoverflow.com/a/59832012) which directly answers replacing consume with relaxed, with a link to a good video of a talk about doing this for Linux RCU, etc. — Peter Cordes, Nov 26 '20 at 03:54
@WenzelJakob: note that AArch64 has relatively cheap `acquire` with LDAR, which is very cheap if you don't also have release-stores soon after. (ldar can't reorder past stlr, so you get seq_cst instead of acq_rel. So an acquire and then an unrelated release-store would still drain the store buffer whether you like it or not.) — Peter Cordes, Nov 26 '20 at 03:57

zertyz · Answer 2 · 2019-08-22T02:27:45.640

The answer @PeterCordes gave was very instructive, but didn't address all of the problem.

I'm writing my own answer because I too had to implement a lock-free stack and failed on the reentrancy tests for the pop operation.

The implementation given by Mr. Cordes didn't count on the underlying ABA problem.

Understanding the reentrancy issue: when you attempt to pop the head of the stack, the CAS (compare_and_exchange) operation only continues if the head of the stack "is the same".

Being "the same" is the key here: as long as the CAS instruction goes, being the same means the pointer is the same, but the data is not necessarily so -- what if, in the mean time, the stack suffers the proper pop from a second thread? ... and, after that, another thread (a third thread) pushes back a new element, which happens to be stored at the same address the head, in thread 1, had?

In this case, the CAS instruction in thread #1 would succeed, but taking into account a ->next pointer which is no longer valid.

The proper way of avoiding this ABA problem seems to be storing an ATOMIC HEAD structure consisting of both the head pointer and the next pointer.

The proposed solution is implemented here -- MTL's UnorderedArrayBasedReentrantStack

The reentrancy tests are here -- https://github.com/zertyz/MTL/blob/master/tests/cpp/UnorderedArrayBasedReentrantStackSpikes.cpp

Tested on x86_64 and ARM 32 & 64 bits.

Hope this helps someone.

score 2 · Answer 3 · answered Dec 17 '17 at 23:11

2

Imagine that between loading old_head and dereferencing old_head->next, the cpu was diverted by an interrupt and didn’t get back to this sequence for a very long time (days, weeks, etc..). Meanwhile, some other thread has popped ‘old_head’ from your stack, processed it, and returned it to a heap and possibly repurposed it for another object.

The reason it works for ‘push’, is that the ‘pushing code' owns the object to be pushed. That isn’t true of ‘pop’ -- pop is discovering the object then attempting to gain ownership of it. To work with ‘lock free’ you must be able to perform both operations simultaneously; which makes linked lists difficult, if not unusable.

By comparison, with an array you know that ‘next’ is ‘top - 1’, so:

do {
   x = stack[temp = top];
} while (cswap(&top, temp, temp-1) != temp);

is temptingly close. The rub is that you need to encode a generation count into top, so that every assignment to ‘top’ is unique:

struct uuidx { int index; very_large_int sequence; };
extern (volatile, atomic, whatever) struct uuidx top;

...
struct uuidx temp, next;
do {
    x = stack[(temp = top).index];
    next = (struct uuidx){.index = temp.index - 1,
                 .sequence = temp.sequence+1};
} while (cswap(&top, temp, next) != temp)

answered Dec 17 '17 at 23:11

mevets

10,070
1
21
33

It's not clear from the code the OP posted, since it's an apparently incorrectly simplified version of his real code, but he doesn't actually return a _reference_ to the popped object, but rather copies the popped object via the assignment operator into an object passed by the caller. So there is no memory freeing problem here as a copy is made to avoid that issue: the internal objects are not exposed to the caller. The OP provided his actual code [here](https://pastebin.com/L34h9Xsk). Why he didn't fix the question to make that clear I don't know. – BeeOnRope Dec 18 '17 at 00:32
I updated the question to avoid more confusion. @mevets I think my code is not vulnerable according to the issue "Imagine that between loading old_head and dereferencing old_head->next..." because when the other thread popped `old_head` away, I can still dereference the pointer `old_head->next` even if it points to another object now. But if this is the case, `old_head` does not equal `head` anymore, so the CAS fails and overwrites `old_head` with the current value of `head`. – lukasl1991 Dec 18 '17 at 13:46
How do you know you can still dereference it? Is there a hidden garbage collector? If not, it could have been freed to the heap. Since it was released, the region (page) it resides in could have been unmapped, so old_head->next will result in a SIGSEGV. GC changes everything, but not in a good way... – mevets Dec 18 '17 at 16:33
The missing of delete is another problem in my algorithm... :-/ – lukasl1991 Dec 19 '17 at 08:44

score 0 · Answer 4 · answered Dec 17 '17 at 10:47

0

This is my solution as far:

template <class T>
bool CASStack<T>::pop(T& ret) {
    node<T>* new_head;

    // get the current head
    node<T>* old_head = head.load(std::memory_order_relaxed);

    do {
        // it is a null pointer iff our stack is empty
        if(old_head == nullptr) {
            return false;
        }

        // otherwise, we can dereference it and access its next node
        new_head = old_head->next;
    } while(!head.compare_exchange_weak(old_head, new_head, std::memory_order_acquire, std::memory_order_relaxed));

    // finally write the popped value into ret
    ret = old_head->data;
    return true;
}

I would highly appreciate your assesment. I know about two issues with this code:

1) If another thread pushes an element between head.load and the nullptr comparison, my algorithm does not pop it. I don't have any idea how to fix this.

2) Within the push operation, elements are created with new. My code crashes, if I add a delete old_head; before return true;. So I know that this algorithm has a memory leak. Could I apply this solution?

answered Dec 17 '17 at 10:47

lukasl1991

241
3
11

1

*If another thread pushes an element between head.load and the nullptr comparison*, then it happened after this thread attempted to pop. Normally you won't be racing with other threads, so design your code to be efficient in the no-race case and correct in all cases. (And not too slow when there is contention, of course). Depending on the use-case, you might want to spin a few times before returning false, but that definitely isn't better in general. In most use-cases (other than your stress test) if the stack is empty when you check, it will still be empty a few thousand cycles later. – Peter Cordes Dec 17 '17 at 23:36
See my answer for why all the loads need to be `acquire`, including the `fail` case of CAS, and why testing won't reveal the potential problem that could bite if compiled on a different platform. That's why I used the version that only has one ordering arg. My last update to my answer also explains why you crash if `delete` unmaps the page while other threads are still dereferencing the old head. I think you know that because you're talking about using `shared_ptr`, but interesting that you actually see it in practice; I figured the node would usually just end up on a free list. – Peter Cordes Dec 17 '17 at 23:42
Yes using `shared_ptr` somehow could solve that, but it's somewhat expensive: every pop attempt has to atomically increment the `shared_ptr` ref counter, and then decrement it after successfully popping. Then in the rare case where a thread trying to pop holds the last reference, it will `delete` the node. RCU-style deletion schemes (like a generation counter that tracks when all threads have passed it) would reduce that overhead, but you'd have to keep track of the objects to delete later. That requires cooperation from the calling thread outside of this function, though. – Peter Cordes Dec 17 '17 at 23:44
So if you want to keep it simple, one option would be to allocate your nodes from a pool that you manage yourself, so they're always safe to read even after you "free" them. Hmm, if you recycle nodes right away, you could have an ABA problem where CAS succeeds because `head` is the same, even though it's actually pointing to an object with different contents. I wonder if that's why you actually crash. – Peter Cordes Dec 17 '17 at 23:50
@PeterCordes `shared_ptr` objects are not thread-safe, just like almost any other standard library object. You can have _different_ `shared_ptr` objects which happen to refer to the same object and the reference count will be manipulated in a thead-safe way (just like different threads assigning _different_ dumb pointers to objects that happen to be shared), but `shared_ptr` objects themselves are not suitable for sharing, except in the "usual" way standard library objects are (only if all threads are calling `const` methods) - maybe that's enough though? – BeeOnRope Dec 18 '17 at 00:50
1

@BeeOnRope: When `head.compare_exchange_weak(old_head, ...` fails, it loads `old_head = head`. [It's a reference arg](http://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange). – Peter Cordes Dec 18 '17 at 00:59
Ah, oops! Yeah it's weird that it crashes with `delete`, perhaps there is an ABA problem if the node gets handed back out to another thread (that's common for free-lists: they like to hand out the most recently freed node since it's the hottest). – BeeOnRope Dec 18 '17 at 01:04
@BeeOnRope: Oh right, I forgot exactly how `shared_ptr` worked. The ref count is in the separate control-block that the `shared_ptr` object itself has a pointer to. Maybe if `head` was a `atomic`? There's a chicken/egg problem here which I didn't think through before: the ref counts can't just be in the `Node` objects, because you can't increment the ref count before dereferencing `head`. I don't think you can avoid problems by not decrementing until after you modify `head` either. It's not obvious where to put a refcount that isn't just a mutex for `head` in disguise. – Peter Cordes Dec 18 '17 at 01:09
1

I don't think `atomic` is allowed because `T` for `atomic` must be trivially copyable, and `shared_ptr` isn't. One approach to the problem you mention is _double_ reference counting. You store something like a `pair` as `head` where `count` is a reference `count` and `Node` itself has _another_ count. Every thread that wants to dereference `head` first increments the `count` embedded in `head` using CAS of the entire pair, then it can go ahead and access the `Node`. When it's done, it increments the reference count inside the `Node` (nothing gets decremented here). – BeeOnRope Dec 18 '17 at 01:15
1

So the effective reference count is `Node.count - head.count`, and when that's zero the holding thread can delete `Node`. You need cmpxchg16 do this simply, although you can probably "pack" this into 64-bits if you were motivated. A reasonable solution though is probably just to use a dedicated allocator that just keeps recycling nodes so they are always "safe" to read, and a generation counter to avoid ABA, which you can pack together into a 64-bit value as `head`. The packing is easier here since you don't need full pointers: just a node index which could be only as many bits as you need. – BeeOnRope Dec 18 '17 at 01:18
@PeterCordes: [That's](https://stackoverflow.com/questions/47755527/lockfree-stack-with-atomic#comment82685773_47854310) a good argument. So my first concern is blown away. No, I don't want to spin a few times before returing false. Furthermore, I did not notice the update of your answer. I changed the memory order to aquire now. Do I have to use acquire also for the [push example](http://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange)? – lukasl1991 Dec 18 '17 at 14:31
@PeterCordes: I actually thought about managing a free list but I need to synchronize operations on this free list as well... Maybe it would be easier returning a `Node*` instead of a boolean? – lukasl1991 Dec 18 '17 at 14:34
I think, your both discussion exceeds my horizon... :-/ Whats the consensus now? Is there a "simple" way freeing memory? – lukasl1991 Dec 18 '17 at 14:44
@lukasl1991: Well RCU-style deletion would have the caller involved in taking part in some infrequent all-thread-are-past-this-point thing so you could delete old nodes. Without that, no, you definitely can't return a `Node*`. Having the caller `delete` is is no better than deleting it yourself. – Peter Cordes Dec 18 '17 at 14:50
@BeeOnRope (and Lukas): I was thinking about the whole idea of this yesterday: most of these problems go away if there's only a single consumer thread running `pop`. Probably the author of the `push` example on CPPreference didn't have an entire data structure in mind, just atomic `push`, or they were thinking single-writer / single-reader. (Multi-writer / single reader works, too.) But when do you ever want a stack instead of a queue, especially with multiple readers/writers? A stack has more contention between writers and readers. (vs. none for a single-writer / single-reader queue) – Peter Cordes Dec 18 '17 at 14:57
See https://stackoverflow.com/questions/45907210/lock-free-progress-guarantees for an analysis of how a lockless multi-writer / multi-reader queue works. (Using a fixed-size circular buffer, not dynamically-allocated nodes: this is often much more sensible) – Peter Cordes Dec 18 '17 at 14:59
1

@PeterCordes - well if it weren't for the memory management problem, node-based structures would be fairly straightforward and would be the bread and butter of lock-free structures, as they are in academia (which often simply ignores the problem or says "use hazard pointers" or something similar without evaluating the costs) and in garbage collected languages (you also see hybrid structures there like `ConcurrentHashMap` which does have an array as a key component). You really saw an explosion of concurrent structures and use in Java for example... – BeeOnRope Dec 18 '17 at 17:14
1

... but this memory management complication is really impeding their uptake in native code. In fact, you can now see expert recommendations in the area like: "use lock-free structures only if you need lock-free semantics such as no-deadlock, safety in signal handlers - but not for performance because they are usually slower". That's the opposite of good implementations in GC'd languages where they are often an order of magnitude faster than the lock-based approaches and are almost always used because of better performance and scalability. – BeeOnRope Dec 18 '17 at 17:16
As for why you'd use a stack? Well I guess it depends on the application requirements, but certainly queues are far more common as it is usual in a producer-consumer scenario to consume things in a FIFO order (e.g., requests, work). You could certainly imagine an external requirement that imposed a LIFO order at which point you might use a stack. I could also imagine it being better for performance when work is being passed between threads: processing the most recently ready work might find a lot more of the relevant lines in the cache, for example. – BeeOnRope Dec 18 '17 at 17:19
Finally, restricting either the number of consumers or producers to 1 (it isn't totally clear what you mean by "writer" and "reader" in the context of a stack - let's say a producer pushes and a consumer pops) definitely simplifies the implementation. In high quality libraries (at least in GC'd languages where much of the advances happen) you often find all 4 specializations: MPMC (multiple producer & consumer, fully general), MPSC, SPMP, SPSC, since each specialization allows better performance. – BeeOnRope Dec 18 '17 at 17:27

score 0 · Answer 5 · answered Jun 04 '21 at 22:06

You cannot implement a lockfree stack if you expect multiple threads doing both push and pop operations.

In that scenario, modifying the stack pointer and accessing the data for read or write on the stack should be done atomically. If not you have two cases, both of which are inconsistent, for push memory operation ordering:

stack pointer increment happens before writing data on the stack: in this case a concurrent pop operation can be scheduled in the middle and read garbage data from the stack.
stack pointer increment happens after writing data on the stack: in this case a concurrent push may invalidate the data we just wrote before the pointer is incremented

Similarly, if concurrent pop operations are possible reading data and modifying stack pointer can be interleaved with various operations which produce an invalid state:

if pop reads the data after decrementing the pointer, a concurrent push may overwrite the data before it is read
if pop reads the data before decrementing the pointer, a concurrent pop followed by a push may result in the old data being read twice, and the new data lost

A (partially) lockless implementation is possible if only one thread is pushing, and only one is popping data:

push can read the stack pointer with acquire semantic, and safely write over the stack pointer because no other thread is using that zone of memory, and then atomically update the stack pointer to signal the pop thread that data below the pointer is valid (with release semantic/memory ordering)
pop can read the stack pointer value with acquire semantic, read the data on it, and then compare exchange the value with release semantic to signal the other thread that the memory used by the popped value is available again for new values. On compare exchange failure it means more value have been pushed, and it can read the new value on the new top of the stack.

In code:

void push(T value) {
    auto stp = stack_pointer.load(memory_order_acquire);
    stack[++stp] = value;
    stack_pointer.store(stp, memory_order_release);
}

T pop() {
    auto stp = stack_pointer.load(memory_order_acquire);
    while(true) {
        auto value = stack[stp];
        if (stack_pointer.atomic_compare_exchange_weak(stp,stp-1,memory_order_release,memory_order_acquire)) {
            return value;
        }
    }
}

This implementation is "partially lockless" because the pop implementation must spin if concurrent push happen, which is a kind of lock.

Notice that the code in the question is using a linked list. That makes it possible to have a node fully set up before "publishing" it by storing a pointer to it anywhere that other threads can see. And one reader can atomically claim a node, and take responsibility for reading + eventually freeing it. If you did want to use an array (like one would for a mostly-lock-free queue), yes you'd need to reserve a slot and then write it, perhaps using some kind of sequence number so readers can tell when a slot is fully written. — Peter Cordes, Jun 05 '21 at 02:17
But that would suck a lot compared to a queue because the next read is always of the node you've only just barely constructed, or are still constructing. (opposite of a queue, where it's lock-free except when full or empty. [Lock-free Progress Guarantees](https://stackoverflow.com/q/45907210) is a good analysis of an MPMC queue.) Perhaps some way to let a reader ignore newly-reserved slots until the writer has finished something with a release-store? But then the used part of the storage isn't necessarily contiguous, so you need more bookkeeping which may end up being linked-list-like. — Peter Cordes, Jun 05 '21 at 02:19
@PeterCordes You are right. I misunderstood the question, interpreting `stack` as contiguous memory region. It stills looks to me like a lock-free multiple consumer `pop` operation would be impossible even in the list case: in order to update the `head` pointer with the `next` pointer you need to read the object pointed by `head`, and you can't prevent another concurrent `pop` from removing it and erasing it unless you lock it. Multiple producer is ok because reading the `head` pointer is always allowed. — pqnet, Jun 08 '21 at 16:35
Yeah, the deallocation problem is widely recognized as a serious problem for concurrent data structures like this in non-garbage-collected environments. Solutions include (IIRC) never actually unmapping, but only placing on a free list for reuse in this queue so it's safe to read without faulting, and you'll see a different sequence number. Or maybe something like collective progress barriers like RCU uses, so threads can know there are no readers older than time X, and free those nodes. — Peter Cordes, Jun 08 '21 at 21:53

Lockfree Stack with atomic

5 Answers5

Memory ordering