11

Boost provides a sample atomically reference counted shared pointer

Here is the relevant code snippet and the explanation for the various orderings used:

class X {
public:
  typedef boost::intrusive_ptr<X> pointer;
  X() : refcount_(0) {}

private:
  mutable boost::atomic<int> refcount_;
  friend void intrusive_ptr_add_ref(const X * x)
  {
    x->refcount_.fetch_add(1, boost::memory_order_relaxed);
  }
  friend void intrusive_ptr_release(const X * x)
  {
    if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) {
      boost::atomic_thread_fence(boost::memory_order_acquire);
      delete x;
    }
  }
};

Increasing the reference counter can always be done with memory_order_relaxed: New references to an object can only be formed from an existing reference, and passing an existing reference from one thread to another must already provide any required synchronization.

It is important to enforce any possible access to the object in one thread (through an existing reference) to happen before deleting the object in a different thread. This is achieved by a "release" operation after dropping a reference (any access to the object through this reference must obviously happened before), and an "acquire" operation before deleting the object.

It would be possible to use memory_order_acq_rel for the fetch_sub operation, but this results in unneeded "acquire" operations when the reference counter does not yet reach zero and may impose a performance penalty.

I am not able to understand why the memory_order_acquire barrier is necessary before the delete x operation. Specifically, how is it safe for the compiler/processor to reorder the memory operations of delete x before the fetch_sub and the test on the value of x == 1 without violating the single threaded semantics?

EDIT I guess, my question wasn't very clear. Here is a rephrased version:

Will the control dependency between the read of x (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) and the delete x operation provide any ordering guarantee at all? Even considering a single threaded program, is it possible for the compiler/processor to reorder the instructions corresponding to the delete x operation before the fetch_sub and the comparison?. It would be really helpful if the answer was as low-level as possible and included an example scenario where the delete operation gets reordered (without affecting the single threaded semantics) thus illustrating the need to preserve ordering.

  • 3
    It's probably best to think of `acquire` semantics more abstractly. It's *not* (only) about whether the operations are reordered, so your question about reordering isn't really relevant. It's about whether this thread "pulls" into its view of memory, any changes that other threads might have made to memory locations other than the refcount, before its own operations involving those memory locations. Without the `acquire` it does not do so, that is to say there are potential data races. Memory barriers are not (only) restrictions on instruction reordering, they're `git push` and `git pull`. – Steve Jessop Jan 03 '15 at 04:58
  • 1
    @SteveJessop - The abstract view does not help me much. TBH I find them even more confusing (and magical) than the hardware centric view. For example, AFAIU acquire and release barriers do not care about the memory accesses in other threads, but rather simply maintain the load->[load/store] and store->[load/store] orderings respectively, for memory accesses in the same thread. Thinking of them as `git push` and `git pull` suggests that they actually know about the other threads in the picture, when they clearly don't. – Scott Peterson Jan 03 '15 at 08:21
  • 1
    Two other SO questions on the same subject: http://stackoverflow.com/questions/10268737/c11-atomics-and-intrusive-shared-pointer-reference-count, http://stackoverflow.com/questions/25304201/why-there-need-memory-order-limit-on-reference-counter. – Alexey Kukanov Jan 03 '15 at 14:42
  • I agree with @SteveJessop that the abstract semantics are the best way to reason about these things, because that is what the standard mandates. For a concrete example, think of "release" as "flush write cache" and "acquire" as "flush read cache". Imagine that the thread deleting x has cached part of the object modified by the other thread. Then, without both the "release" and the "acquire", the `delete` could blow up on non-POD objects. (For POD objects, it is harder to imagine a real-life failure. But "it would trigger undefined behavior" is actually a sufficient answer...) – Nemo Sep 03 '15 at 01:52
  • @Nemo so your description of rel/acq is that rel acts on previous pending writes and acq acts on previous memoized reads? – curiousguy May 08 '19 at 05:11

3 Answers3

6

Consider two threads, each holding one reference to the object, which are the last two references:

------------------------------------------------------------
        Thread 1                              Thread 2
------------------------------------------------------------
   // play with x here

   fetch_sub(...)                            
                                            fetch_sub(...)
   // nothing
                                            delete x;

You have to ensure that any changes made to the object by Thread 1 in //play with x here is visible to Thread 2 when it calls delete x;. For this you need an acquire fence, which, together with the memory_order_release on the fetch_sub() calls, guarantees that the changes made by Thread 1 will be visible.

T.C.
  • 133,968
  • 17
  • 288
  • 421
  • 2
    Will the acquire barrier still be needed if it is guaranteed that x is a pointer to a pod that has no destructor (we can just do free(x))?. Also can the free(x) can be reordered before the `if (x.fetch_sub(...) == 1)` without violating the semantics of the single threaded execution? – Scott Peterson Jan 03 '15 at 04:02
  • 1
    @ScottPeterson: I'm too lazy to wade through the standard for the legalese, but if you have a `free` of a memory block which is unsequenced with respect to access of the same memory in a different thread, that's a data race and undefined behavior. So "It is important to enforce any possible access to the object in one thread (through an existing reference) to happen before deleting the object in a different thread" still applies even if all delete does is `free`. – Steve Jessop Jan 03 '15 at 04:55
  • 2
    @scott imagine the data in a processor cache. It gets modified, but not written. Then the data block is sent to heap for recycle. Then the cache is flushed. Boom, you just wrote to memory you do not own. – Yakk - Adam Nevraumont Jan 03 '15 at 05:50
  • @Yakk Your explanation is more close to what I was looking for than the others (see my edit in the original question). However when you say "data block is sent to heap for recycle", how can that happen since the `delete x` only executes when `x == 1`, even for a single threaded case, and `x == 1` implies that all uses of the data through x is complete (because of the release barrier)? – Scott Peterson Jan 03 '15 at 08:39
  • @scott while far from an expert on this, I thought release is usually defined as relative to other acquires. Ie, things before release in thread A happen before things after acquire in thread B. On some architectures (such as x86 and 64) the only barriers do more than this. – Yakk - Adam Nevraumont Jan 03 '15 at 13:36
  • @SteveJessop: Why do you say that the "free is unsequenced"? The free(X) happens inside the if block i.e only after it is established that "x == 1". Isn't that a control dependency and thus guarantees that the free is sequenced correctly? – Scott Peterson Jan 04 '15 at 12:14
  • @ScottPeterson In standardese terms, the release-acquire pairing is necessary to guarantee that anything in `// play with x here` [*inter-thread happens before*](http://en.cppreference.com/w/cpp/atomic/memory_order#Inter-thread_happens-before) `delete x;`. – T.C. Jan 04 '15 at 12:58
  • @T.C.: In the statement `x->refcount_.fetch_sub()` Is the `Release` necessary to order the read of `refcount` from x (ie. `load [x + offest of refcount]`) or does the data dependency guarantee it? – Scott Peterson Jan 06 '15 at 00:15
  • @T.C.: @Yakk: Is the acquire still necessary even if `// play with x here` only reads from x. i.e x points to immutable shared data? – Scott Peterson Jan 06 '15 at 01:39
  • @ScottPeterson `fetch_sub` is an atomic read-modify-write, so it's always guaranteed to read the last value written before its write. Without the acquire, there's no ordering between `//play with x here` and `delete x;`, so you have a data race. – T.C. Jan 06 '15 at 02:19
  • @T.C.What I don't understand is since `delete x` executes only when refcount is 0, and x is immutable, why should we worry about data race. What exactly does the "Acquire" barrier do at the hardware level? If x were instead mutable, then it makes sense to have an acquire, since the "Acquire" could potentially bring in the queued updates made by other threads. But if x is immutable there is nothing to bring in...and since the Release barrier already makes sure all loads from x are complete before the fetch_sub, and delete x only executes when refcount is 0, what function does Acquire have ? – Scott Peterson Jan 06 '15 at 02:32
  • @T.C. "_always guaranteed to read the last value_" What is the "last" value? Last compared to what? – curiousguy May 08 '19 at 05:19
2

From, http://en.cppreference.com/w/cpp/atomic/memory_order

memory_order_acquire -- A load operation with this memory order performs the acquire operation on the affected memory location: prior writes made to other memory locations by the thread that did the release become visible in this thread.

...

Release-Acquire ordering

If an atomic store in thread A is tagged std::memory_order_release and an atomic load in thread B from the same variable is tagged std::memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B, that is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

The synchronization is established only between the threads releasing and acquiring the same atomic variable. Other threads can see different order of memory accesses than either or both of the synchronized threads.

On strongly-ordered systems (x86, SPARC TSO, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire). On weakly-ordered systems (ARM, Itanium, PowerPC), special CPU load or memory fence instructions have to be used.

This means that release allows other threads to synchronize pending operations from current thread, while the later acquire fetches all modified changes from the other threads.

On strongly-ordered systems, this is not as important. I don't think these instructions even generate code as the CPU automatically locks cache lines before any writes can occur. The cache is guaranteed to be consistent. But on weekly ordered systems, while atomic operations are well defined, there could be pending operations to other parts of memory.

So, let's say threads A and B and both share some data D.

  1. A gets some lock and it does things to D
  2. A releases lock
  3. B releases lock, finds 0 ref count and so decides to delete D
  4. deletes D
  5. ... data pending in #1 is not visible yet, so bad things happen.

with the thread fence acquire before delete, the current thread synchronizes all pending operations from other threads in its address space. And when delete happens, it sees what A did in #1.

user3427419
  • 1,769
  • 11
  • 15
0

I think I found a rather simple example that shows why the acquire fence is needed.

Let's assume our X looks like this:

struct X
{
    ~X() { free(data); }
    void* data;
    atomic<int> refcount;
};

Let's further assume that we have two functions foo and bar that look like this (I'll inline the reference count decrements):

void foo(X* x)
{
    void* newData = generateNewData();
    free(x->data);
    x->data = newData;
    if (x->refcount.fetch_sub(1, memory_order_release) == 1)
        delete x;
}

void bar(X* x)
{
    // Do something unrelated to x
    if (x->refcount.fetch_sub(1, memory_order_release) == 1)
        delete x;
}

The delete instruction will execute x's destructor and then free the memory occupied by x. Let's inline that:

void bar(X* x)
{
    // Do something unrelated to x
    if (x->refcount.fetch_sub(1, memory_order_release) == 1)
    {
        free(x->data);
        operator delete(x);
    }
}

Because there is no acquire fence, the compiler could decide to load the address x->data to a register before executing the atomic decrement (as long as there is no data race, the observable effect would be the same):

void bar(X* x)
{
    register void* r1 = x->data;
    // Do something unrelated to x
    if (x->refcount.fetch_sub(1, memory_order_release) == 1)
    {
        free(r1);
        operator delete(x);
    }
}

Now let's assume that refcount of x is 2 and that we have two threads. Thread 1 calls foo, thread 2 calls bar:

  1. Thread 2 loads x->data to a register.
  2. Thread 1 generates new data.
  3. Thread 1 frees the "old" data.
  4. Thread 1 assigns the new data to x->data.
  5. Thread 1 decrements refcount from 2 to 1.
  6. Thread 2 decrements refcount from 1 to 0.
  7. Thread 2 frees the "old" data again instead of the new data.

Key insight for me was that "prior writes [...] become visible in this thread" can mean something trivial as "do not use values you cached to registers before the fence".