10

I've written a container for a very simple piece of data that needs to be synchronized across threads. I want the top performance. I don't want to use locks.

I want to use "relaxed" atomics. Partly for that little bit of extra oomph, and partly to really understand them.

I've been working on this a lot, and I'm at the point where this code passes all tests I throw at it. That's not quite "proof" though, and so I'm wondering if there's anything I'm missing, or any other ways I can test this?

Here's my premise:

  • It is only important that a Node be properly pushed and popped, and that the Stack can never be invalidated.
  • I believe that the order of operations in memory is only important in one place:
    • Between the compare_exchange operations themselves. This is guaranteed, even with relaxed atomics.
  • The "ABA" problem is solved by adding identification numbers to the pointers. On 32 bit systems, this requires a double-word compare_exchange, and on 64 bit systems the unused 16 bits of the pointer are filled with id numbers.
  • Therefore: the stack will always be in a valid state. (right?)

Here's what I'm thinking. "Normally", the way we reason about code that we're reading is to look at the order in which it's written. Memory can be read or written to "out of order", but not in a way that invalidates the correctness of the program.

That changes in a multi-threaded environment. That's what memory fences are for - so that we can still look at the code and be able to reason about how it's going to work.

So if everything can go all out-of-order here, what am I doing with relaxed atomics? Isn't that a bit too far?

I don't think so, but that's why I'm here asking for help.

The compare_exchange operations themselves give a guarantee of sequential constancy with each other.

The only other time there is read or write to an atomic is to get the head's initial value before a compare_exchange. It is set as part of the initialization of a variable. As far as I can tell, it would be irrelevant whether or not this operation brings back a "proper" value.

Current code:

struct node
{
    node *n_;
#if PROCESSOR_BITS == 64
    inline constexpr node() : n_{ nullptr }                 { }
    inline constexpr node(node* n) : n_{ n }                { }
    inline void tag(const stack_tag_t t)                    { reinterpret_cast<stack_tag_t*>(this)[3] = t; }
    inline stack_tag_t read_tag()                           { return reinterpret_cast<stack_tag_t*>(this)[3]; }
    inline void clear_pointer()                             { tag(0); }
#elif PROCESSOR_BITS == 32
    stack_tag_t t_;
    inline constexpr node() : n_{ nullptr }, t_{ 0 }        { }
    inline constexpr node(node* n) : n_{ n }, t_{ 0 }       { }
    inline void tag(const stack_tag_t t)                    { t_ = t; }
    inline stack_tag_t read_tag()                           { return t_; }
    inline void clear_pointer()                             { }
#endif
    inline void set(node* n, const stack_tag_t t)           { n_ = n; tag(t); }
};

using std::memory_order_relaxed;
class stack
{
public:
    constexpr stack() : head_{}{}
    void push(node* n)
    {
        node next{n}, head{head_.load(memory_order_relaxed)};
        do
        {
            n->n_ = head.n_;
            next.tag(head.read_tag() + 1);
        } while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));
    }

    bool pop(node*& n)
    {
        node clean, next, head{head_.load(memory_order_relaxed)};
        do
        {
            clean.set(head.n_, 0);

            if (!clean.n_)
                return false;

            next.set(clean.n_->n_, head.read_tag() + 1);
        } while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));

        n = clean.n_;
        return true;
    }
protected:
    std::atomic<node> head_;
};

What's different about this question compared to others? Relaxed atomics. They make a big difference to the question.

So, what do you think? Is there anything I'm missing?

Michael Gazonda
  • 2,720
  • 1
  • 17
  • 33
  • What is a "relaxed atomic?" – John Dibling Jul 18 '14 at 18:08
  • The premises are to be applied to operations with relaxed memory order? What are the (thread-safety) requirements on object of type `Node*` managed with the stack and on the stack itself? – dyp Jul 18 '14 at 18:12
  • (I don't think you get the guarantees you'd need from an operation with relaxed memory order, since its side-effect is not required to be propagated immediately to other threads. So another thread might see an old value, which causes problems to both `push` and `pop` - those need provide the same node to all threads.) – dyp Jul 18 '14 at 18:13
  • As far as thread safety goes, an object on the stack is *only* the pointer to the next item. There should be no thread safety issues. – Michael Gazonda Jul 18 '14 at 18:16
  • What I mean with thread-safety requirements is: May multiple threads use the same `Stack` object without synchronization? May they use both `push` and `pop`? May they use a single `Node` object (or `Node*`) without synchronization? etc. – dyp Jul 18 '14 at 18:18
  • I think I'm getting side-tracked here. I'm still not convinced I need to change anything. *The instructions are atomic.* I don't think I care about the order. If the wrong order can invalidate the Stack, that would mean I need to change something. – Michael Gazonda Jul 18 '14 at 18:23
  • If two threads may access `compareAndSwap` of the *same* `Node` object without synchronization, then both can read the "old" value, and both set them to a (different) new value. This will create an orphaned `Node`. – dyp Jul 18 '14 at 18:28
  • @dyp - I don't see it, but if you could put this into an answer it would be appreciated. I'll look into how relaxed ordering of compare_and_exchange works more thoroughly. – Michael Gazonda Jul 18 '14 at 18:32
  • How do you prove your algorithm correct? Start with a formal definition of each atomic operation. Write a proof parallel to each method, starting with preconditions and ending with conclusions. Generally I find it best to embed said proof in the code as a sort of explicit curry-howard. If your proof is sufficiently formal, the remaining part is to ensure that the (informal) mapping between your proof of correctness and *actually* what you mean by "correct" holds. All of this is pretty hard to do. – Yakk - Adam Nevraumont Jul 18 '14 at 18:39
  • @Yakk Thank you. I guess I'm looking to understand all of the things that can go "wrong" before attempting a formal proof. That way I'd be pretty sure it's correct before spending the time proving it. – Michael Gazonda Jul 18 '14 at 18:42

3 Answers3

4

push is broken, since you do not update node->_next after a compareAndSwap failure. It's possible that the node you originally stored with node->setNext has been popped from the top of stack by another thread when the next compareAndSwap attempt succeeds. As a result, some thread thinks it has popped a node from the stack but this thread has put it back in the stack. It should be:

void push(Node* node) noexcept
{
    Node* n = _head.next();
    do {
        node->setNext(n);
    } while (!_head.compareAndSwap(n, node));
}

Also, since next and setNext use memory_order_relaxed, there's no guarantee that _head_.next() here is returning the node most recently pushed. It's possible to leak nodes from the top of the stack. The same problem obviously exists in pop as well: _head.next() may return a node that was previously but is no longer at the top of the stack. If the returned value is nullptr, you may fail to pop when the stack is not actually empty.

pop can also have undefined behavior if two threads try to pop the last node from the stack at the same time. They both see the same value for _head.next(), one thread successfully completes pop. The other thread enters the while loop - since the observed node pointer is not nullptr - but the compareAndSwap loop soon updates it to nullptr since the stack is now empty. On the next iteration of the loop, that nullptr is dererenced to get its _next pointer and much hilarity ensues.

pop is also clearly suffering from ABA. Two threads can see the same node at the top of the stack. Say one thread gets to the point of evaluating the _next pointer and then blocks. The other thread successfully pops the node, pushes 5 new nodes, and then pushes that original node again all before the other thread wakes. That other thread's compareAndSwap will succeed - the top-of-stack node is the same - but store the old _next value into _head instead of the new one. The five nodes pushed by the other thread are all leaked. This would be the case with memory_order_seq_cst as well.

Casey
  • 41,449
  • 7
  • 95
  • 125
  • Wouldn't the compare_and_exchange guarantee that we're writing the proper node? If we pull in bad data with next() and memory_order_relaxed, then the compare_and_exchange will fail and give us the new value for us to try again with. – Michael Gazonda Jul 18 '14 at 21:33
  • 1
    @MGaz It would, except that you aren't storing the new value in the node - you dump it into a temporary `Node*` that you ignore (`n`). – Casey Jul 18 '14 at 21:37
  • Sigh. Last minute changes. I should've caught that one. Ty for the catch. I'll get this going soon... – Michael Gazonda Jul 18 '14 at 21:38
  • code has been updated that works across all tests I can throw at it. ty again. – Michael Gazonda Jul 24 '14 at 20:24
2

Leaving to one side the difficulty of implementing the pop operation, I think memory_order_relaxed is inadequate. Before pushing the node, one assumes that some value(s) will be written into to it, which will be read when the node is popped. You need some synchronization mechanism to ensure that the values have actually been written before they are read. memory_order_relaxed is not providing that synchronization... memory_order_acquire/memory_order_release would.

  • It seems that a successful compare_exchange operation will give the required synchronization. See Casey's comments on my answer below. – Michael Gazonda Jul 20 '14 at 13:56
  • The point of `memory_order_relaxed` is that you have **no idea** what order it happens in wrt **other** memory operations on **other** memory locations. –  Jul 20 '14 at 14:01
  • Yes, but compare_exchange operations have their own guarantees. – Michael Gazonda Jul 20 '14 at 14:07
  • 1
    I understand that all read-modify-write operations on location **x** have a global ordering. I have looked, but have failed to find anything which tells me that a `memory_order_relaxed` read-modify-write has any special synchronization properties wrt to other locations... I confess that every time I read the C11 standard on the topic I am ready to strangle the authors :-(... so if I am missing something here, I shall be happy to learn. –  Jul 20 '14 at 15:01
  • Afaik, I don't need synch with _other_ locations though, just the locations directly used in the compare_exchange. I might be missing something too. This is all very complex. – Michael Gazonda Jul 20 '14 at 15:03
  • The head_ node is where the story of memory continuity is being told. Every successful operation there becomes a part of that story, and even relaxed memory ordering can't mess with that. There are no other locations I'm using that require a complete synch. There's still something bugging me about the push operation though, and I'll be looking into that one today. – Michael Gazonda Jul 20 '14 at 15:08
  • Well... if you don't synchronize for _other_ locations, (and assuming that the node does carry some data), how do you know that the write(s) of the data (_sequenced before_) the push, are visible to (_happen before_) the read(s) of that data after the pop ? –  Jul 20 '14 at 15:22
  • As far as the Stack is concerned, the only data the Node carries is the pointer to next_, which is a part of the sync. I'm going to think about your questions some more though, as I know there's something I'm missing. – Michael Gazonda Jul 20 '14 at 15:26
2

This code is completely broken.

The only reason this appears to work is that current compilers aren't very aggressive with reordering across atomic operations and x86 processors have pretty strong guarantees.

The first problem is that without synchronization, there is no guarantee that the client of this data structure will even see the fields of the node object to be initialized. The next issue is that without synchronization, the push operation can read arbitrarily old values for the head's tag.

We have developed a tool, CDSChecker, that simulates most behaviors that the memory model allows. It is open source and free. Run it on your data structure to see some interesting executions.

Proving anything about code that utilizes relaxed atomics is a big challenge at this point. Most proof methods break down because they are typically inductive in nature, and you don't have an order to induct on. So you get out of thin air read issues...

briand
  • 280
  • 1
  • 4
  • "_Proving anything about code that utilizes relaxed atomics is a big challenge_" No. It's impossible, period, because a relaxed load could see an invalid value or even invalid representation coming "from the future" and causes UB. You can't have UB in a language with such semantics. Apparently the C++ committee copy-pasted piece of Java semantics w/o giving it any thoughts. **The specification is meaningless.** Of course real impl never do that and only do necessary things but the spec writers clearly had no idea how to capture that. – curiousguy May 06 '19 at 18:29
  • "_because they are typically inductive in nature, and you don't have an order to induct on_" That also is blatantly false **because "latest" is used**. The C++ std is wrong. – curiousguy May 06 '19 at 18:32
  • @curiousguy You can definitely prove some limited code patterns that use relaxed correct. Consider the test and test and set pattern used to avoid contention in certain spin locks. The initial test load can be a relaxed and the correctness proof shouldn't be too difficult. – briand May 08 '19 at 05:36