I've written a container for a very simple piece of data that needs to be synchronized across threads. I want the top performance. I don't want to use locks.
I want to use "relaxed" atomics. Partly for that little bit of extra oomph, and partly to really understand them.
I've been working on this a lot, and I'm at the point where this code passes all tests I throw at it. That's not quite "proof" though, and so I'm wondering if there's anything I'm missing, or any other ways I can test this?
Here's my premise:
- It is only important that a Node be properly pushed and popped, and that the Stack can never be invalidated.
- I believe that the order of operations in memory is only important in one place:
- Between the compare_exchange operations themselves. This is guaranteed, even with relaxed atomics.
- The "ABA" problem is solved by adding identification numbers to the pointers. On 32 bit systems, this requires a double-word compare_exchange, and on 64 bit systems the unused 16 bits of the pointer are filled with id numbers.
- Therefore: the stack will always be in a valid state. (right?)
Here's what I'm thinking. "Normally", the way we reason about code that we're reading is to look at the order in which it's written. Memory can be read or written to "out of order", but not in a way that invalidates the correctness of the program.
That changes in a multi-threaded environment. That's what memory fences are for - so that we can still look at the code and be able to reason about how it's going to work.
So if everything can go all out-of-order here, what am I doing with relaxed atomics? Isn't that a bit too far?
I don't think so, but that's why I'm here asking for help.
The compare_exchange operations themselves give a guarantee of sequential constancy with each other.
The only other time there is read or write to an atomic is to get the head's initial value before a compare_exchange. It is set as part of the initialization of a variable. As far as I can tell, it would be irrelevant whether or not this operation brings back a "proper" value.
Current code:
struct node
{
node *n_;
#if PROCESSOR_BITS == 64
inline constexpr node() : n_{ nullptr } { }
inline constexpr node(node* n) : n_{ n } { }
inline void tag(const stack_tag_t t) { reinterpret_cast<stack_tag_t*>(this)[3] = t; }
inline stack_tag_t read_tag() { return reinterpret_cast<stack_tag_t*>(this)[3]; }
inline void clear_pointer() { tag(0); }
#elif PROCESSOR_BITS == 32
stack_tag_t t_;
inline constexpr node() : n_{ nullptr }, t_{ 0 } { }
inline constexpr node(node* n) : n_{ n }, t_{ 0 } { }
inline void tag(const stack_tag_t t) { t_ = t; }
inline stack_tag_t read_tag() { return t_; }
inline void clear_pointer() { }
#endif
inline void set(node* n, const stack_tag_t t) { n_ = n; tag(t); }
};
using std::memory_order_relaxed;
class stack
{
public:
constexpr stack() : head_{}{}
void push(node* n)
{
node next{n}, head{head_.load(memory_order_relaxed)};
do
{
n->n_ = head.n_;
next.tag(head.read_tag() + 1);
} while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));
}
bool pop(node*& n)
{
node clean, next, head{head_.load(memory_order_relaxed)};
do
{
clean.set(head.n_, 0);
if (!clean.n_)
return false;
next.set(clean.n_->n_, head.read_tag() + 1);
} while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));
n = clean.n_;
return true;
}
protected:
std::atomic<node> head_;
};
What's different about this question compared to others? Relaxed atomics. They make a big difference to the question.
So, what do you think? Is there anything I'm missing?