C++ memory_order with fences and aquire/release

Question

I have the following C++ 2011 code:

std::atomic<bool> x, y;
std::atomic<int> z;

void f() {
   x.store(true, std::memory_order_relaxed);
   std::atomic_thread_fence(std::memory_order_release);
   y.store(true, std::memory_order_relaxed);
}

void g() {
   while (!y.load(std::memory_order_relaxed)) {}
   std::atomic_thread_fence(std::memory_order_acquire);
   if (x.load(std::memory_order_relaxed)) ++z;
}

int main() {
   x = false;
   y = false;
   z = 0;
   std::thread t1(f);
   std::thread t2(g);
   t1.join();
   t2.join();
   assert(z.load() !=0);
   return 0;
}

At my computer architecture class, we've been told that the assert in this code always comes true. But after reviewing it thouroughly now, I can't really understand why it's so.

For what I know:

A fence with 'memory_order_release' will not allow previous stores to be executed after it
A fence with 'memory_order_acquire' will not allow that any load that comes after it to be executed before it.

If my understanding is correct, why can't the following sequence of actions occur?

Inside t1, y.store(true, std::memory_order_relaxed); is called
t2 runs entirely, and will see a 'false' when loading 'x', therefore not increasing z in a unit
t1 finishes execution
In the main thread, the assert fails because z.load() returns 0

I think this complies with 'acquire'-'release' rules, but, for example in the best answer in this question: Understanding c++11 memory fences which is very similar to my case, it hints that something like step 1 in my sequence of actions cannot happen before the 'memory_order_release', but doesn't get into details for the reason behind it.

I'm terribly puzzled about this, and will be very glad if anyone could shed some light on it :)

Mats Petersson · Accepted Answer · 2013-01-24T11:16:01.153

Exactly what happens in each of these cases depends on what processor you are actually using. For example, x86 would probably not assert on this, since it is a cache-coherent architecture (you can have race-conditions, but once a value is written out to cache/memory from the processor, all other processors will read that value - of course, doesn't stop another processor from writing a different value immediately after, etc).

So assuming this is running on an ARM or similar processor that isn't guaranteed to be cache-coherent by itself:

Because the write to x is done before the memory_order_release, the t2 loop will not exit the while(y...) until x is also true. This means that when x is being read later on, it is guaranteed to be one, so z is updated. My only slight query is as to if you don't need a release for z as well... If main is running on a different processor than t1 and t2, then z may stil have a stale value in main.

Of course, that's not GUARANTEED to happen if you have a multitasking OS (or just interrupts that do enough stuff, etc) - since if the processor that ran t1 gets its cache flushed, then t2 may well read the new value of x.

And like I said, this won't have that effect on x86 processors (AMD or Intel ones).

So, to explain barrier instructions in general (also applicable to Intel and AMD process0rs):

First, we need to understand that although instructions can start and finish out of order, the processor does have a general "understanding" of order. Let's say we have this "pseudo-machine-code":

 ...
 mov $5, x
 cmp a, b
 jnz L1
 mov $4, x

L1: ...

THe processor could speculatively execute mov $4, x before it completes the "jnz L1" - so, to solve this fact, the processor would have to roll-back the mov $4, x in the case where the jnz L1 was taken.

Likewise, if we have:

 mov $1, x
 wmb         // "write memory barrier"
 mov $1, y

the processor has rules to say "do not execute any store instruction issued AFTER wmb until all stores before it has been completed". It is a "special" instruction - it's there for the precise purpose of guaranteeing memory ordering. If it's not doing that, you have a broken processor, and someone in the design department has "his ass on the line".

Equally, the "read memory barrier" is an instruction which guarantees, by the designers of the processor, that the processor will not complete another read until we have completed the pending reads before the barrier instruction.

As long as we're not working on "experimental" processors or some skanky chip that doesn't work correctly, it WILL work that way. It's part of the definition of that instruction. Without such guarantees, it would be impossible (or at least extremely complicated and "expensive") to implement (safe) spinlocks, semaphores, mutexes, etc.

There are often also "implicit memory barriers" - that is, instructions that cause memory barriers even if they are not. Software interrupts ("INT X" instruction or similar) tend to do this.

+1 for interesting information. But question: Are you saying that on a non-cache-coherent architecture, even the `memory_order_release`-memory fence does not suffice to ensure the caches of other processors are updated? — jogojapan, Jan 24 '13 at 01:35
No, I'm saying that `using memory_order_release` will ensure cache-coherency. But I have to apologize, I read your code wrong - I don't see how this would fail the assertion. If y is true in t2, then x will be true in t2. So, assuming the `t2` loop eventually finishes, then `z` should be incremented. Sorry about that. — Mats Petersson, Jan 24 '13 at 01:39
I will update the answer. But bear in mind that the two "join" ensures that both threads are finished before the main gets to the assert. — Mats Petersson, Jan 24 '13 at 01:41
Oh and, although it's not my question, just to clarify what I think the point of the question is: Can `y.store()` be executed before `x.store()` in `f()`, or does the Standard guarantee it won't? — jogojapan, Jan 24 '13 at 01:45
Ah, sorry, misread who had commented vs. original queston answerer. — Mats Petersson, Jan 24 '13 at 01:46
The `memory_order_release` guarantees [at least should guarantee] that nothing written AFTER that is executed BEFORE that instruction. How it really works is that the processor has various buffers and stuff to store writes into before they actually go out into memory. When it hits a "write barrier", it will make sure all writes pending at that point are actually written to memory (and invalidated if the processor isn't cache-coherent). Same with a read barrier for pending reads - it ensures all pending reads are completed before the barrier instruction "lets the next instruction continue". — Mats Petersson, Jan 24 '13 at 01:49
Hi Mats. Thanks for your answer, but I still don't see it. As you say, a "write barrier" will make sure that all writes pending are written to memory at that point, but that doesn't imply that a write that comes AFTER the barrier cannot be executed BEFORE the barrier (e.g., with instruction reordering by the compiler or the processor). So the y.store() could be executed before the x.store() in f() — alfongj, Jan 24 '13 at 09:23

score 3 · Answer 2 · answered Jan 24 '13 at 12:35

I don't like arguing about C++ concurrency questions in terms of "this processor does this, that processor does that". C++11 has a memory model, and we should be using this memory model to determine what is valid and what isn't. CPU architectures and memory models are usually even harder to understand. Plus there's more than one of them.

With this in mind, consider this: thread t2 is blocked in the while loop until t1 executes the y.store and the change has propagated to t2. (Which, by the way, could in theory be never. But that's not realistic.) Therefore we have a happens-before relationship between the y.store in t1 and the y.load in t2 that allows it to leave the loop.

Furthermore, we have simple intra-thread happens-before relations between the x.store and the release barrier and the barrier and the y.store.

In t2, we have a happens-before between the true-returning load and the acquire barrier and the x.load.

Because happens-before is transitive, the release barrier happens-before the acquire barrier, and the x.store happens-before the x.load. Because of the barriers, the x.store synchronizes-with the x.load, which means the load has to see the value stored.

Finally, the z.add_and_fetch (post-increment) happens-before the thread termination, which happens-before the main thread wakes from t2.join, which happens-before the z.load in the main thread, so the modification to z must be visible in the main thread.

Yes, so the bottom line is that what I wasn't understanding is that the `std::atomic_thread_fence(std::memory_order_release);` creates a happens-before relationship between itself pluas previous stores, and the `y.store(true, std::memory_order_relaxed);` . Unfortunately Mats already explained it so I give him the checkmark for quickness :) — alfongj, Jan 24 '13 at 12:57

C++ memory_order with fences and aquire/release

2 Answers2