memory model, how load acquire semantic actually works?

Question

From very nice Paper and article about memory reordering.

Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?

Store release is quite understandable, have to wait for all load and store are completed before set flag to true.

About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:

int x = 0;
std::atomic<bool> ready_flag = false;

// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
    // (1)
    // load x here
}
// (2)
// load x here

// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);

EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.

Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?

Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?

C++ is not specified in term of a specific CPU model. The question doesn't make sense in term of the std. Maybe you meant in term of separately compiled code, and the ABI? — curiousguy, Nov 11 '19 at 17:40
If you really want to ask a Q about the CPU worded in term of C++, **make everything volatile**. As a rule, when you want to go low, use volatile. — curiousguy, Nov 11 '19 at 17:44
"_CPU have to wait for ready_flag_" Modern CPU do as much execution in advance as possible. Actually all execution begins as speculative to be later confirmed: almost any asm instr could cause an exception that should stop execution (if exceptions are precise), so almost all instr are effectively a conditional. — curiousguy, Nov 11 '19 at 17:52
@LongLT: "*CPU have to wait for ready_flag*" That's not what your code says. It only checks the flag *once*; it's not waiting for anything. It also reads `x` regardless of the state of the flag. And even that is ignoring the bad memory order. — Nicol Bolas, Nov 11 '19 at 18:03

Nicol Bolas · Accepted Answer · 2019-11-11T18:32:34.540

Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.

As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:

Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.

To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.

Yeah, "acquire-equivalent operation happen before reading x." this one is very important. — LongLT, Dec 28 '21 at 05:03

Peter Cordes · Answer 2 · 2019-11-11T19:34:46.347

Because (1) and (2) depends on if condition, CPU have to wait for ready_flag

There are 2 showstopper flaws in that reasoning:

Branch prediction + speculative execution is a real thing in real CPUs. Control dependencies behave differently from data dependencies. Speculative execution breaks control dependencies.

In most (but not all) real CPUs, data dependencies do work like C++ memory_order_consume. A typical use-case is loading a pointer and then dereferencing it. That's still not safe in C++'s very weak memory model, but will happen to compile to asm that works for most ISAs other than DEC Alpha. Alpha can (in practice on some hardware) even manage to violate causality and load a stale value when dereferencing a just-loaded pointer, even if the stores were correctly ordered.
Compilers can break control and even data dependencies. C++ source logic doesn't always translate directly to asm. In this case a compiler could emit asm that works like this:
```
 tmp = load(x);         // compile time reordering before the relaxed load
 if (load(ready_flag)
    actually use tmp;
```
It's data-race UB in C++ to read x while it might still be being written, but for most specific ISAs there's no problem with that. You just have to avoid actually using any load results that might be bogus.

This might not be a useful optimization for most ISAs but nothing rules it out. Hiding load latency on in-order pipelines by doing the load earlier might actually be useful sometimes, (if it wasn't being written by another thread, and the compiler might guess that wasn't happening because there's no acquire load).

By far your best bet is to use ready_flag.load(mo_acquire).

A separate problem is that you have commented out code that reads x after the if(), which will run even if the load didn't see the data ready. As @Nicol explained in an answer, this means data-race UB is possible because you might be reading x while the producer is writing it.

Perhaps you wanted to write a spin-wait loop like while(!ready_flag){ _mm_pause(); }? Generally be careful of wasting huge amounts of CPU time spinning; if it might be a long time, use a library-supported thing like maybe a condition variable that gives you efficient fallback to OS-supported sleep/wakeup (e.g. Linux futex) after spinning for a short time.

If you did want a manual barrier separate from the load, it would be

 if (ready_flag.load(mo_relaxed))
     atomic_thread_fence(mo_acquire);
     int tmp = x;   // now this is safe
 }
 // atomic_thread_fence(mo_acquire);  // still wouldn't make it safe to read x
 // because this code runs even after ready_flag == false

Using if(ready_flag.load(mo_acquire)) would lead to an unconditional fence before branching on the ready_flag load, when compiling for any ISA where acquire-load wasn't available with a single instruction. (On x86 all loads are acquire, on AArch64 ldar does an acquire load. ARM needs load + dsb ish)

"_load a stale value that was written after_" that phrase is hard to parse — curiousguy, Nov 11 '19 at 19:28
@PeterCordes, assume x is in thread-2 cache already. Base on your answer, I guess the memory-reorder sequence is like this, T1 -> update x, send invalidate request to T2. T2 queues the message, then acks back to T1. Without read-fence, T2 will read the old value of x. If there is read-fence, T2 wait until message in invalidation queue is processed ? So write-release cooperates with store buffer, read-acquire cooperates with invalidate-queue ? — LongLT, Nov 12 '19 at 04:20
@PeterCordes, btw thanks for your great answer and effort. I figured out that compiler tends to load then trigger fence before does branch, like this: `bool is_ready = ready_flag.load(); __sync_synchronize(); if (is_ready) { }`. Anyway, it is exactly with your answer — LongLT, Nov 12 '19 at 04:45
@LongLT: If `x` is already hot in T2's private/local cache, `load register, [x]` can just execute while `load reg, [ready_flag]` is waiting on a cache miss. So it takes a value for `x` from the coherent cache *before* it does that for `ready_flag`. This is LoadLoad reordering. T1 can't make a store to `x` globally visible (commit to L1d cache) until *after* it receives a response to its RFO request for exclusive ownership of that cache line. — Peter Cordes, Nov 12 '19 at 05:00
@LongLT: You have invalidate queues backwards. No wonder I thought I'd never heard of something that worked that way (a queue for delayed responses to RFOs). [what is a store buffer?](//stackoverflow.com/q/11105827) explains invalidate queues: it's what a core uses to track / wait for *responses* to RFO and invalidates it sent out, while waiting to commit a store from the store buffer into L1d. Normally you'll see a Read For Ownership, not just an invalidate, unless a core is doing a full-line write (e.g. x86 NT stores with write-combining before they become architecturally visible.) — Peter Cordes, Nov 12 '19 at 05:04

score 0 · Answer 3 · answered Nov 11 '19 at 18:05

The C++ standard doesn't specify the code generated by any particular construct; only correct combinations of thread communication tools product a guaranteed result.

You don't get guarantees from the CPU in C++ because C++ is not a kind of (macro) assembly, not even a "high level assembly", at least not when not all objects have a volatile type.

Atomic objects are communication tools to exchange data between threads. The correct use, for correct visibility of memory operations, is either a store operation with (at least) release followed by a load with acquire, the same with RMW in between, either the store (resp. the load) replaced by RMW with (at least) a release (resp. acquire), on any variant with a relaxed operation and a separate fence.

In all cases:

the thread "publishing" the "done" flag must use a memory ordering at least release (that is: release, release+acquire or sequential consistency),
and the "subscribing" thread, the one acting on the flag must use at least acquire (that is: acquire, release+acquire or sequential consistency).

In practice with separately compiled code other modes might work, depending on the CPU.

memory model, how load acquire semantic actually works?

3 Answers3

Linked