0

The below class describes a lock free stack of uint32_t sequential values (full code here). For instance, LockFreeIndexStack stack(5); declares a stack containing the numbers {0, 1, 2, 3, 4}. This class has pool semantic. The capacity of the stack is fixed. Only the values originally introduced in the stack can be extracted and reinserted. So at any particular point in time any of those values can be either inside the stack or outside, but not both. A thread can only push an index that it previously got via a pop. So the correct usage is for a thread to do:

auto index = stack.pop(); // get an index from the stack, if available
if(index.isValid()) {
    // do something with 'index'
    stack.push(index); // return index to the stack
}

Both the push and pop methods are implemented with an atomic load and a CAS loop.

What is the correct memory order semantic I should use in the atomic operations in pop and push (I wrote my guess commented out)?

struct LockFreeIndexStack
{
    typedef uint64_t bundle_t;
    typedef uint32_t index_t;

private:
    static const index_t s_null = ~index_t(0);

    typedef std::atomic<bundle_t> atomic_bundle_t;

    union Bundle {
        Bundle(index_t index, index_t count)
        {
            m_value.m_index = index;
            m_value.m_count = count;
        }

        Bundle(bundle_t bundle)
        {
            m_bundle = bundle;
        }

        struct {
            index_t m_index;
            index_t m_count;
        } m_value;

        bundle_t m_bundle;
    };

public:
    LockFreeIndexStack(index_t n)
        : m_top(Bundle(0, 0).m_bundle)
        , m_next(n, s_null)
    {
        for (index_t i = 1; i < n; ++i)
            m_next[i - 1] = i;
    }

    index_t pop()
    {
        Bundle curtop(m_top.load());  // memory_order_acquire?
        while(true) {
            index_t candidate = curtop.m_value.m_index;

            if (candidate != s_null) {  // stack is not empty?
                index_t next = m_next[candidate];
                Bundle newtop(next, curtop.m_value.m_count);
                // In the very remote eventuality that, between reading 'm_top' and
                // the CAS operation other threads cause all the below circumstances occur simultaneously:
                // - other threads execute exactly a multiple of 2^32 pop or push operations,
                //   so that 'm_count' assumes again the original value;
                // - the value read as 'candidate' 2^32 transactions ago is again top of the stack;
                // - the value 'm_next[candidate]' is no longer what it was 2^32 transactions ago
                // then the stack will get corrupted
                if (m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle)) {
                    return candidate;
                }
            }
            else {
                // stack was empty, no point in spinning
                return s_null;
            }
        }
    }

    void push(index_t index)
    {
        Bundle curtop(m_top.load());    // memory_order_relaxed?
        while (true) {
            index_t current = curtop.m_value.m_index;
            m_next[index] = current;
            Bundle newtop = Bundle(index, curtop.m_value.m_count + 1);
            if (m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle)) {
                return;
            }
        };
    }

private:
    atomic_bundle_t m_top;
    std::vector<index_t> m_next;
};
Fabio
  • 2,105
  • 16
  • 26
  • `Bundle(0, 0).m_bundle` exhibits undefined behavior, by way of accessing an inactive member of the union. Similarly with `Bundle curtop(m_top.load()); curtop.m_value.m_index` – Igor Tandetnik Nov 07 '18 at 14:34
  • `if (candidate != s_null)` looks wrong. The constructor initializes `m_top` to zero, not to `~0`. – Igor Tandetnik Nov 07 '18 at 14:39
  • 1
    There's a race between reading `m_next[candidate]` in `pop`, and assigning `m_next[index] = current` in `push`. Nothing prevents `index` to be the same as `candidate` – Igor Tandetnik Nov 07 '18 at 14:48
  • @Igor, regarding union, your point is correct as per standard. I could perhaps use `reinterpret_cast`, but that works the way I expect on most compilers (at least on clang, gcc and MSVC). Regarding `m_next`, it is initialized to `{1,2,...,s_null}`, so `top=0` is correct, it points to element `0`, which in turn points to element `1`, etc. Regarding the race, I need to look at it better. Consider however this code is tested (see the mentioned bitbucket repo). – Fabio Nov 07 '18 at 15:06
  • 1
    Regarding the race, I added some clarification on the semantic of this particular stack object. In `pop`, `candidate` is contained in the stack, in `push`, `index` is not contained in the stack, hence they cannot be the same. – Fabio Nov 07 '18 at 15:19
  • Nothing appears to prevent two threads from calling `pop()` and `push(0)` simultaneously on a freshly initialized `LockFreeIndexStack`. This would result in a race on `m_next[0]`. You seem to be making some assumptions about the behavior of the callers, but never state those assumptions. – Igor Tandetnik Nov 07 '18 at 15:29
  • @Igor, Regarding `union`, I found that type punning is explicitly allowed in the `gcc` online documentation. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Type%2Dpunning – Fabio Nov 07 '18 at 15:48
  • 1
    @Igor, a thread can only `push` an index that previously got trough a `pop`. It cannot arbitrarily call `push(0)`, if it did not previously got index `0’ via a `pop`. I will clarify in the question. – Fabio Nov 07 '18 at 15:51
  • 1
    must be `m_top.load(memory_order_relaxed)` in both `push` and `pop`. here we need only atomic read data. when in general must be `m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle, memory_order_acquire)` in `pop` and `m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle, memory_order_release)` in `push`. sense - if you write some data based on `index` and then `push` it (with release semantic) - all this writes will be visible after you `pop` with `acquire`. and no race in your code, it correct. however enough `m_count + 1` only in `pop` (or `push`) – RbMm Nov 07 '18 at 16:44
  • 1
    and if stack is empty think you need simply `return s_null` instead spin loop – RbMm Nov 07 '18 at 16:45
  • @RbMm, good suggestion to increment the counter only in `pop` or in `push`. I also took your advice to return `s_null` if the stack is empty, delegating to the caller a decision on how to handle that. The question is updated. Regarding memory order, could you please elaborate your explanation? If I do a `relaxed` read in `pop`, isn't there a risk that I keep getting `s_null`, so that the `CAS` is never attempted? Since `CAS` is a read/write operation, doesn't the semantic need to be at least `acq_rel`? Thanks – Fabio Nov 08 '18 at 03:13
  • Good point about UB. It's unfortunate that ISO C++ makes it so hard to do an efficient atomic load of *part* of a larger lock-free atomic object. It is safe to do so on most hardware, but expressing it in C++ is hard. [C++20 `atomic_ref`](https://en.cppreference.com/w/cpp/atomic/atomic_ref) may help. However, GNU C++ does make union type-punning as an extension, so with enough `static_assert` you can make sort-of-safe code that loads one of a pair of integers efficiently. Related: [How can I implement ABA counter with c++11 CAS?](https://stackoverflow.com/a/38991835) – Peter Cordes Nov 08 '18 at 05:38
  • @Peter, the use of a `union` is not strictly necessary. I could use just `struct alignas(8) Bundle { uint32_t m_index, m_count; }` and then define `typedef atomic bundle_t`. I just tested that, it is still lock-free and works fine. Or I could still use the union but never access `m_bundle`, leave it in the union just to achieve alignment on 8-byte boundaries. But I am not sure if any of these would be better, as all solutions seem to work fine. Any thought on the memory order? – Fabio Nov 08 '18 at 07:51
  • for read `m_top` you not need any memory order synchronization. strictly speaking you not need even atomic read here: even if you separate read `m_index` and `m_count` and between this 2 reads one of this values will be changed - even in this case will be no error, you simply go to the second loop. you only need atomic `m_index` read, which is auto on all real platforms (read int is atomic). so `Bundle curtop(m_top.load(memory_order_relaxed))` (in both pop and push) is ok and enough. real check point is `m_top.compare_exchange_weak` which not allow illegal values of... – RbMm Nov 08 '18 at 11:13
  • ... `curtop` (if it `!= m_top`). which here need memory order depend from *do something with 'index'* and *index.isValid()* if say you simply allocate `index` via pop - say allocate memory block from array at `index` and you not wait any context of block -, then when you want free block - simply do `push(index)` - the `memory_order_relaxed` in `compare_exchange_weak` is enough. from other side - if you write some data to block at `index` then `push(index)` it, and after `index = pop()` use content written by previous `push(index)` you need release at push and acquire on pop. – RbMm Nov 08 '18 at 11:13
  • possible implementation - https://pastebin.com/q2vJeGJY – RbMm Nov 08 '18 at 11:14
  • @RbMm, if I understand, suppose we have a shared array `int shared[stack_size] = {};`, then we have `auto index = stack.pop(); if (index.isValid()) { shared[index] += 2; stack.push(index); }`, then I need `release` at `push` and `acquire` at `pop`? – Fabio Nov 09 '18 at 02:08
  • `shared[index] += 2` assume that only single thread access `shared[index]` – RbMm Nov 09 '18 at 09:14
  • @Fabio: btw, just an idea: a better name for this structure would perhaps be *"pool"* rather than "stack". The requirement that you must push only what you previously popped is less obvious compared to the usual pool usage. I.e. with something like `if (pool.tryborrow(&idx)) { do_stuff(&idx); pool.return(&idx); }` it's more obvious that 1) borrow might fail, 2) you are acquiring a unique value, and 3) you are returning something you borrowed. – vgru Nov 23 '18 at 09:03
  • @Groo, yes, I came to your same conclusion. Why the reference? – Fabio Nov 24 '18 at 08:46
  • @RbMm, why does it imply single thread access? ` auto index=stack.pop(); shared[index]+=2; stack.push(index);` is the main loop run by multiple threads. So thread 1 can push an index which then is pop by thread 2. – Fabio Nov 24 '18 at 08:49
  • @Fabio Oh I didn't realize it was `index_t`, I thought it was just an `int` so I used a single function to borrow and check if valid. I think RbMm meant that borrowed indices must be unique, which is true in this case. – vgru Nov 24 '18 at 09:23

0 Answers0