Possible orderings with memory_order_seq_cst and memory_order_release

Question

With reference to the following code

auto x = std::atomic<std::uint64_t>{0};
auto y = std::atomic<std::uint64_t>{0};

// thread 1
x.store(1, std::memory_order_release);
auto one = y.load(std::memory_order_seq_cst);

// thread 2
y.fetch_add(1, std::memory_order_seq_cst);
auto two = x.load(std::memory_order_seq_cst);

Is it possible here, that one and two are both 0?

(I seem to be hitting a bug that could be explained if one and two could both hold the value of 0 after the code above runs. And the rules for ordering are too complicated for me to figure out what orderings are possible above.)

No, it's not. I'm deep enough in there to understand why not, but I don't get it well enough to be able to explain... — Daniël van den Berg, May 25 '21 at 18:38
I do not see how it can happen. The only way for it to happen would be via reordering of two operations in those threads (i.e. setting `one` or `two` before `store` or `fetch_add`), but this can't happen due to memory ordering. And absent such reordering, at least one of them must be non-0. — SergeyA, May 25 '21 at 19:00
@SergeyA: A release operation can reorder in one direction, even past later seq_cst loads. seq_cst isn't even guaranteed to be a superset of release, e.g. I seem to recall some ISA (or perhaps just a gap in the formalism, I forget) where an acquire load might not fully synchronize-with a seq_cst store, or something like that. (That was considered undesirable and something that could hopefully be fixed or ruled out at least in practice.) — Peter Cordes, May 25 '21 at 19:17
@SergeyA: Found the thing I was remembering: [P0668R5](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html) - a proposal to use weaken the ISO C++ standard's wording in places (using "simply happens before") to account for the fact that there's a complicated 3-thread interaction with release and seq_cst that in theory (with the current C++->asm mappings and the PowerPC memory model) makes a result possible that the current C++ standard forbids. We still don't have to worry about SC loads directly syncing-with release stores, though, the test case is much more complex. — Peter Cordes, May 25 '21 at 20:26

Peter Cordes · Accepted Answer · 2021-05-26T17:58:06.757

6

Yes, it's possible for both loads to get 0.

Within thread 1, y.load can "pass" x.store(mo_release) because they're not both seq_cst. The global total order of seq_cst operations that ISO C++ guarantees must exist only includes seq_cst operations.

(In terms of hardware / cpu-architecture for a normal CPU, the load can take a value from coherent cache before the release-store leaves the store buffer. In this case, I found it much easier to reason in terms of how I know it compiles for x86 (or to generic release and acquire operations), then apply asm memory-ordering rules. Applying this reasoning assumes that the normal C++->asm mappings are safe, and are always at least as strong as the C++ memory model. If you can find a legal reordering this way, you don't need to wade through the C++ formalism. But if you don't, that of course doesn't prove it's safe in the C++ abstract machine.)

Anyway, the key point to realize is that a seq_cst operation isn't like atomic_thread_fence(mo_seq_cst) - Individual seq_cst operations only have to recover/maintain sequential consistency in the way they interact with other seq_cst operations, not with plain acquire/release/acq_rel. (Similarly, acquire and release fences are stronger 2-way barriers, unlike acquire and release operations as Jeff Preshing explains.)

The reordering that makes this happen

That's the only reordering possible; the other possibilities are just interleavings of the program-order of the two threads. Having the store "happen" (become visible) last leads to the 0, 0 result.

I renamed one and two to r1 and r2 (local "registers" within each thread), to avoid writing things like one == 0.

// x=0 nominally executes in T1, but doesn't have to drain the store buffer before later loads
auto r1 = y.load(std::memory_order_seq_cst);   // T1b             r1 = 0 (y)
         y.fetch_add(1, std::memory_order_seq_cst);      // T2a   y = 1 becomes globally visible
         auto r2 = x.load(std::memory_order_seq_cst);    // T2b   r2 = 0 (x)
x.store(1, std::memory_order_release);         // T1a             x = 0 eventually becomes globally visible

This can happen in practice on x86, but interestingly not AArch64. x86 can do release-store without additional barriers (just a normal store), and seq_cst loads are compiled the same as plain acquire, just a normal load.

On AArch64, release and seq_cst stores use STLR. seq_cst loads use LDAR, which has a special interaction with STLR, not being allowed to read cache until the last STLR drains from the store buffer. So release-store / seq_cst load on ARMv8 is the same as seq_cst store / seq_cst load. (ARMv8.3 added LDAPR, allowing true acquire / release by letting acquire loads compile differently; see this Q&A.)

However, it can also happen on many ISAs that use separate barrier instructions, like ARM32: a release store will typically be done with a barrier and then a plain store, preventing reordering with earlier loads / stores, but not stopping reordering with later. If the seq_cst load avoids needing a full barrier before itself (which is the normal case), then the store can reorder after the load.

For example, a release store on ARMv7 is dmb ish; str, and a seq_cst load is ldr; dmb ish, so you have str / ldr with no barrier between them.

On PowerPC, as seq_cst load is hwsync; ld; cmp; bc; isync, so there's a full barrier before the load. (The HeavyWeight Sync is I think part of preventing IRIW reordering, to block store-forwarding between SMT threads on the same physical core, only seeing stores from other cores when they actually become globally visible.)

edited May 26 '21 at 17:58

answered May 25 '21 at 19:07

Peter Cordes

328,167
45
605
847

1

Wow, I loved reading this answer! Always love your posts. I have a question though -- till now I had assumed seq_cst was strictly a superset of similar release/acquire operations. Is this basically an example where replacing seq_cst with release/acquire will make it so that this reordering is impossible? – Curious May 25 '21 at 19:51
Actually, it seems both loads returning 0 is still possible with acquire/release. (right?) – Curious May 25 '21 at 19:56
2

@Curious: yes, thread 1 using release and acquire instead of release and seq_cst makes it "obvious" that this reordering is allowed. https://preshing.com/20120913/acquire-and-release-semantics/ shows their semantics: reordering is allowed in one direction. – Peter Cordes May 25 '21 at 20:01
1

@Curious: Yes, seq_cst *is* basically a superset of acq_rel (or acquire or release). In the standard, a seq_cst load can sync-with a release-store, for example. The actual asm may use *different* barriers instead of just *more* barriers. Apparently there's a possible problem with the standard asm mapping on PowerPC, though: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html has some details. It involves a complicated interaction between three threads mixing a release and SC on two variables, and depending on the global total order of SC operations as well as syncs-with. – Peter Cordes May 25 '21 at 20:20
Would it be possible to use an atomic_thread_fence(seq_cst) to prevent this reordering? I only know how to use atomic_thread_fence(acq/rel) and that sounds like it would not work here because we want to force the release store in thread 1 from moving past the load. But a release fence only guarantees synchronization when used before an atomic write. – Curious May 26 '21 at 17:43
1

Yes, a seq_cst thread fence would keep the store ahead of the load in thread 1. (I'm not sure how the C++ formalism describes it, but in practice they always compile to a full barrier instruction, which includes a StoreLoad barrier, unlike a release fence. See [Preshing's diagram](https://preshing.com/20130922/acquire-and-release-fences/). I'm not 100% sure that weaker operations + seq_cst barriers is fully guaranteed in the ISO C++ abstract machine to recover sequential consistency the way it does in practice on real machines, but I think / assume so.) – Peter Cordes May 26 '21 at 17:49
Thanks! On a related note, could you actually point me to some documentation that I can use to understand how atomic_thread_fence(seq_cst) actually works? I've been trying to google this but the only good source I can find is the official std documentation (or the cppreference equivalent), but those are too hairy for me to follow :/ (Since it compiles down to just mfence on x86, I guess understandable documentation for that would be sufficient also?) – Curious May 26 '21 at 20:50
1

@Curious: In terms of reordering a thread's accesses to coherent cache, `atomic_thread_fence(mo_seq_cst)` is a full barrier, blocking all loads and stores from crossing it in either direction. (This implies draining the store buffer before any later loads/stores). Certainly in practice on most, maybe all, machines; e.g. `mfence` or `lock or byte ptr[rsp], 0` on x86, `hwsync` on Power, `dsb ish` on ARMv7 / 8. If you want to understand how it works in the C++ formalism, in terms of guarantees of a total order existing, and of creating happens-before relations, you need to read the standard. – Peter Cordes May 26 '21 at 22:31
A variant of the reordering in this question: a seq-cst store followed by an acquire load. Does C++ allow this load to be reordered ahead of this store? I would guess yes but I am not so confident. – zanmato Dec 07 '21 at 04:40
2

@zanmato: yup, that reordering is allowed. Easiest way to be sure is that it can compile to AArch64 `stlr` (seq_cst) / `ldapr` (acquire). Only `ldar` (seq_cst) has a special interaction with `stlr` that blocks StoreLoad reordering with `stlr`. It's generally safe to assume that the C++ -> asm mappings are at least as strong as what ISO C++ requires (i.e. not buggy), although at least one really obscure corner case has been discovered for PowerPC, involving a less-than-SC RMW on an seq_cst store or something, and an acquire or seq_cst load by a third thread... The basics are fine, though. – Peter Cordes Dec 07 '21 at 04:56
1

And I don't think x86 can produce this reordering (but it can produce the reordering in the original question). So this variant makes a nice example of hardware not able to take the room that C++ standard spares for possible optimization? – zanmato Dec 07 '21 at 05:33
1

@zanmato: correct; x86 is like most ISAs where seq_cst store is done with a full barrier as part of or after the store, preventing reordering with anything after. – Peter Cordes Dec 07 '21 at 07:01

Possible orderings with memory_order_seq_cst and memory_order_release

1 Answers1

The reordering that makes this happen

Linked

Related