59

Stores are release operations and loads are acquire operations for both. I know that memory_order_seq_cst is meant to impose an additional total ordering for all operations, but I'm failing to build an example where it isn't the case if all the memory_order_seq_cst are replaced by memory_order_acq_rel.

Do I miss something, or the difference is just a documentation effect, i.e. one should use memory_order_seq_cst if one intend not to play with a more relaxed model and use memory_order_acq_rel when constraining the relaxed model?

curiousguy
  • 8,038
  • 2
  • 40
  • 58
AProgrammer
  • 51,233
  • 8
  • 91
  • 143

4 Answers4

55

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false;
bool y= false;
int z= 0;

a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }

// kick off a, b, c, d, join all threads
assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

Drew Dormann
  • 59,987
  • 13
  • 123
  • 180
MSN
  • 53,214
  • 7
  • 75
  • 105
  • I don't understand why `x=true;y=true;c();d()` isn't possible? That should cause it to be 0. Also I don't know why i get 2 a lot as the results. –  Sep 10 '12 at 09:49
  • 2
    @acidzombie24, even in that case, `z` will be 2. – MSN Sep 10 '12 at 20:40
  • I messed up, i misread the code. That makes perfect sense now –  Sep 11 '12 at 08:53
  • @MSN I don't understand this example. 1. don't while(!x) and while(!y) guarantees that either if(y) or if(x) returns true? even we use per atomic variable ordering? 2. how does global ordering help? (this may come clear if i understand 1). thx. – Candy Chiu Jan 05 '16 at 16:36
  • 7
    @CandyChiu With ack_rel, `c()` can perceive that `x=true;` in `a()` happens before `y=true;` in `b()` at the same time `d()` can perceive that `y=true;` happens before `x=true;` (due to lack of "global ordering".) In particular `c()` can perceive `x==true` and `y==false` at the same time `d()` can perceive `y==true` and `x==false`. So `z` might not be incremented by either of `c()` or `d()`. With seq_cst, if `c()` perceives `x=true;` happens before `y=true;`, so does `d()`. – nodakai Jan 20 '16 at 11:03
  • 1
    @MSN You meant `int z=0`, not `bool z=0` – nodakai Jan 20 '16 at 11:04
  • 2
    @nodakai, Your explanation is accurate but I think the phrase "happens before" can be misleading since the crux of the issue with acquire-release is that neither of the writes _happens-before_ the other. – jhoffman0x May 01 '18 at 03:50
  • 4
    This example is using pure loads and pure stores, not any actual RMW operations that could use `std::memory_order_acq_rel`. In an atomic read-modify-write, the load and store are tied together because they're an atomic. I'm not sure when if ever `acq_rel` can differ from `seq_cst` for something like `.fetch_add` or `.compare_exchange_weak` – Peter Cordes Nov 02 '19 at 06:32
  • 1
    @PeterCordes If you only use RMW in a program, then obviously acq + rel => seq_cst. – curiousguy Nov 28 '19 at 00:45
  • 1
    @curiousguy: I'm not convinced that's true. seq_cst requires that all threads agree on a total order; hardware that isn't multi-copy-atomic (e.g. POWER) can use a weaker barrier in acq_rel RMWs vs. seq_cst RMWs (https://godbolt.org/z/C4MuhU lwsync instead of sync before the RMW), possibly leading to some threads disagreeing about the relative order of things. i.e. the IRIW experiment but with RMWs instead of pure stores. I *think* acq_rel RMWs could still work as observers without making the delay between 1st and 2nd load so long that you can't see the effect. Or failed-CAS is just a load :P – Peter Cordes Nov 28 '19 at 03:12
  • @MSN does sequential consistency force threads to start in that specific order (that is a, b, c, d)? – Ignorant Dec 27 '19 at 20:28
  • 1
    @Ignorant Sequential consistency means that all those operations that have that guarantee (seq_cst ops) are done as if done in some global order, as if the operations by diff threads were interlaced. On a mono CPU system, asm level operations are always so. – curiousguy May 02 '20 at 04:10
  • Why I cannot trigger the `assert(z!=0)` on my intel x86 platform? – Jay Zhuang Sep 26 '20 at 18:21
  • @curiousguy are you saying a `seq_cst` operation guarantees that compiler and CPU cannot reorder any store and load happens before and after that `seq_cst` regardless whether those stores and loads are related or depending on the `seq_cst` operation? And `acq_rel` operation can synchronize with any `acquire` operation on that same memory address but nothing more? – HCSF Nov 25 '21 at 02:06
  • @JayZhuang See https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/ . On Intel X86: "In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility)." In other words, on Intel, it is not possible, that the cores with threads c() and d() see a different order of the writes. – Kai Petzke Mar 28 '23 at 16:06
15

On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:

  • seq_cst stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible.

  • acquire or release do not have to flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics. (seq_cst plus a store buffer with store forwarding.)

    But x86 atomic RMW operations always get promoted to seq_cst because the x86 asm lock prefix is a full memory barrier. Other ISAs can do relaxed or acq_rel RMWs in asm, with the store side being able to do limited reordering with later stores. (But not in ways that would make the RMW appear non-atomic: For purposes of ordering, is atomic read-modify-write one operation or two?)


https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov + mfence vs. plain mov in x86 asm. In practice xchg is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov+mfence)


Fun fact: AArch64's LDAR acquire-load instruction is actually a sequential-acquire, having a special interaction with STLR. Not until ARMv8.3 LDAPR can arm64 do plain acquire operations that can reorder with earlier release and seq_cst stores (STLR). (seq_cst loads still use LDAR because they need that interaction with STLR to recover sequential consistency; seq_cst and release stores both use STLR).

With STLR / LDAR you get sequential consistency, but only having to drain the store buffer before the next LDAR, not right away after each seq_cst store before other operations. I think real AArch64 HW does implement it this way, rather than simply draining the store buffer before committing an STLR.

Strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive, unless you seq_cst store something, and then seq_cst load something else. Then it's just as bad as x86.

Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel or mo_acq_rel more cheaply than mo_seq_cst, but their seq_cst can't be as cheap as AArch64; seq-cst stores need a full barrier.

So AArch64 is an exception to the rule that seq_cst stores drain the store buffer on the spot, either with a special instruction or a barrier instruction after. It's not a coincidence that ARMv8 was designed after C++11 / Java / etc. basically settled on seq_cst being the default for lockless atomic operations, so making them efficient was important. And after CPU architects had a few years to think about alternatives to providing barrier instructions or just acquire/release vs. relaxed load/store instructions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • "_But x86 atomic RMW operations always get promoted to seq_cst because the x86 asm lock prefix is a full memory barrier._" What makes you say they are "promoted"? Also the exec could well speculatively load the value (normally) and do the computation as long as it reloads it safely (locked load) later; if the computation is fast that's probably uninteresting but still possible. (I suppose that these things are documented in a purely descriptive way by Intel for existing designs and not for future ones.) – curiousguy Nov 28 '19 at 01:01
  • @curiousguy: the full-memory-barrier nature of the x86 `lock` prefix is carefully documented by Intel and AMD in their x86 ISA manuals. ([Does lock xchg have the same behavior as mfence?](//stackoverflow.com/q/40409297)). It's definitely guaranteed for future x86 CPUs; how else could compilers make safe future-proof asm? This is what I mean by compilers having to strengthen all RMW operations to seq_cst in the asm, draining the store buffer before the RMW does its thing. – Peter Cordes Nov 28 '19 at 03:03
  • What is guaranteed exactly? That the CPU will not try to get the value already loaded and the computation ready in memory in advance, so speed up a costly RMW, says `xdiv` (or `xcos` if the FPU decides to support RMW)? – curiousguy Nov 30 '19 at 04:53
  • @curiousguy: what? There are no "expensive" atomic RMW instructions. Not even integer multiply. Only cheap basic ALU stuff like `lock add` and `lock or`, and bitstring instructions like `lock bts` (bit test-and-set). And special instructions intended for use with `lock`, like `cmpxchg` and `xadd`. But they're still all easy for the ALU to implement the computation part in a single cycle. – Peter Cordes Nov 30 '19 at 05:17
  • 1
    @curiousguy: But anyway, if a hypothetical implementation wanted to try loading early to set up for a cheaper atomic exchange to actually implement the RMW, it could only do that *speculatively* and roll back on mis-speculation (if the line changed before the load was architecturally allowed). Regular loads already work this way, to get performance while preserving strong load ordering. (See the `machine_clears.memory_ordering` performance counter: [Why flush the pipeline for Memory Order Violation caused by other logical processors?](//stackoverflow.com/q/55563077)) – Peter Cordes Nov 30 '19 at 05:18
  • @curiousguy: See also [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](//stackoverflow.com/a/45610386), especially comments on the top answer exploring the `machine_clears.memory_ordering` explanation for the observed performance. – Peter Cordes Nov 30 '19 at 05:23
  • "_it could only do that speculatively and roll back (...). Regular loads already work this way_" So it would only have to do what it already does on regular operations in order to get acq and rel semantics. So it's doable in principle and guarantees acq + rel. And if in another universe Intel has the same CPU but w/ support of `xdiv`, they can start doing the (potentially useful) optimization w/o breaking code. The only reason why these locked instr don't do it is because a trivial operation is done after the load. – curiousguy Nov 30 '19 at 17:33
  • The point being: `xdiv` can be implemented like that because it's the normal acq + rel semantics that are automatically enforced and it would even be easy to translate it to uops. And there is no "RMW operations always get promoted to seq_cst" because we just have done an atomic mutation w/ acq + rel as usual. Also the concept of a CPU instr being C/C++ seq_cst isn't a concept at all. **It's nonsense: seq_cst doesn't qualify any operation in particular** unlike separately compiled acq and rel. – curiousguy Nov 30 '19 at 17:54
  • 2
    @PeterCordes - I don't even think it's hypothetical: I think that's how atomic operations are (sometimes) implemented on current Intel x86. That is, that they load the cache line in an optimistic locked state, do the "front end" of the RMW (including the ALU op), and then in the "back end" of the RMW they verify everything was OK in the execute-at-retire op that ensures all the ordering. This works great when the location is not contended. If this fails a lot, a predictor will switch modes to doing the whole thing at retire, which causes a bigger bubble in the pipeline (hence "sometimes"). – BeeOnRope Nov 30 '19 at 18:16
  • Not that this contradicts anything you are saying: this is purely an optimization of the implementation, and the documented semantics are (mostly) maintained. I say "mostly" because I think as part of this overall locked-insn optimization process, they broke some ordering stuff like the weakly ordered loads thing, and while they reverted the `mfence` behavior at a cost in performance (so it now fences them), they didn't for locked. It's not clear if they'll basically just change the rules so that reordering is allowed or what... – BeeOnRope Nov 30 '19 at 18:19
  • @BeeOnRope: Oh interesting, I hadn't realized they could be that efficient in the uncontended case. – Peter Cordes Nov 30 '19 at 18:46
  • @curiousguy: When we say that an asm sequence provides seq_cst, we mean the sequence used in asm to implement C++ `mo_relaxed` CAS is the same sequence used to implement `mo_seq_cst` CAS. So if you do that everywhere, you get sequentially-consistent execution. We *also* mean that if programming by hand in asm, using full barriers that way is strong enough to recover sequential consistency on top of a weaker machine model. C++ has a handy abbreviation for that so we use it. But anyway, yes, speculative execution allows tricks. – Peter Cordes Nov 30 '19 at 18:52
  • @PeterCordes Note that mutexes are only acq-rel. – curiousguy Nov 30 '19 at 19:59
  • So a standalone atomic_thread_fence(memory_order_acq_rel) on x86 doesn't prevent StoreLoad reordering? – ilstam May 09 '20 at 23:27
  • @ilstam: Interesting question, I wasn't sure so I checked: https://godbolt.org/z/mRCZme GCC compiles it to no asm, so that's correct; it doesn't include a StoreLoad barrier. Same for PowerPC where it compiles to `lwsync` (light weight sync, not full `sync`). – Peter Cordes May 10 '20 at 00:38
  • "_Other ISAs can do relaxed or acq_rel RMWs in asm, with the store side being able to do limited reordering with later stores_" - shouldn't it rather be "with later _loads_"? Because the only ordering not enforced by acq_rel is StoreLoad. – Daniel Nitzan May 12 '23 at 16:38
  • 1
    @DanielNitzan: An RMW is actually a load and a store; no operation to the *same* location can come between them (giving RMW atomicity), but ops on other locations can actually overlap. The store half has release semantics, and in some cases actually *can* reorder with a later *store* to another location. (Or with a later load, but the Q&A I linked shows an experiment using two stores.) – Peter Cordes May 12 '23 at 16:48
  • 1
    @DanielNitzan: x86 doesn't have this effect; its RMWs are full memory barriers, so the load+store remain fully stuck together in the globally-visible order of the core's operations. (This is somewhat required by the LoadLoad and StoreStore ordering guarantees. If acq_rel RMWs existed on x86, the store side could reorder with later loads but not stores. But that's due to later stores being release ops themselves, the fact that x86 in general only allows StoreLoad reordering. Same for earlier stores but not loads reordering with the load side of a hypothetical acq_rel RMW on x86.) – Peter Cordes May 12 '23 at 16:52
5

Try to build Dekkers or Petersons algorithm with just acquire/release semantics.

That won't work because acquire/release semantics doesn't provide [StoreLoad] fence.

In case of Dekkers algorithm:

flag[self]=1 <-- STORE
while(true){
    if(flag[other]==0) { <--- LOAD
        break;
    }
    flag[self]=0;
    while(turn==other);
    flag[self]=1        
}

Without [StoreLoad] fence the store could jump in front of the load and then the algorithm would break. 2 threads at the same time would see that the other lock is free, set their own lock and continue. And now you have 2 threads within the critical section.

pveentjer
  • 10,545
  • 3
  • 23
  • 40
  • *"Without [StoreLoad] fence the store could jump in front of the load"* In your code, you've marked a STORE that is already in front of a LOAD. Did you mean to say that the LOAD might be executed before the STORE? – dyp Dec 14 '22 at 09:20
  • You are right. The load can jump in front of the store. – pveentjer Dec 14 '22 at 11:15
3

Still use the definition and example from memory_order. But replace memory_order_seq_cst with memory_order_release in store and memory_order_acquire in load.

Release-Acquire ordering guarantees everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load. But in our example, nothing happens before store in both thread0 and thread1.

x.store(true, std::memory_order_release); // thread0

y.store(true, std::memory_order_release); // thread1

Further more, without memory_order_seq_cst, the sequential ordering of thread2 and thread3 are not guaranteed. You can imagine they becomes:

if (y.load(std::memory_order_acquire)) { ++z; } // thread2, load y first
while (!x.load(std::memory_order_acquire)); // and then, load x

if (x.load(std::memory_order_acquire)) { ++z; } // thread3, load x first
while (!y.load(std::memory_order_acquire)); // and then, load y

So, if thread2 and thread3 are executed before thread0 and thread1, that means both x and y stay false, thus, ++z is never touched, z stay 0 and the assert fires.

However, if memory_order_seq_cst enters the picture, it establishes a single total modification order of all atomic operations that are so tagged. Thus, in thread2, x.load then y.load; in thread3, y.load then x.load are sure things.

jun.wu
  • 104
  • 3