2

With reference to the following code

auto x = std::atomic<std::uint64_t>{0};
auto y = std::atomic<std::uint64_t>{0};

// thread 1
x.store(1, std::memory_order_release);
auto one = y.load(std::memory_order_seq_cst);

// thread 2
y.fetch_add(1, std::memory_order_seq_cst);
auto two = x.load(std::memory_order_seq_cst);

Is it possible here, that one and two are both 0?


(I seem to be hitting a bug that could be explained if one and two could both hold the value of 0 after the code above runs. And the rules for ordering are too complicated for me to figure out what orderings are possible above.)

Curious
  • 20,870
  • 8
  • 61
  • 146
  • 2
    No, it's not. I'm deep enough in there to understand why not, but I don't get it well enough to be able to explain... – Daniël van den Berg May 25 '21 at 18:38
  • I do not see how it can happen. The only way for it to happen would be via reordering of two operations in those threads (i.e. setting `one` or `two` before `store` or `fetch_add`), but this can't happen due to memory ordering. And absent such reordering, at least one of them must be non-0. – SergeyA May 25 '21 at 19:00
  • 1
    @SergeyA: A release operation can reorder in one direction, even past later seq_cst loads. seq_cst isn't even guaranteed to be a superset of release, e.g. I seem to recall some ISA (or perhaps just a gap in the formalism, I forget) where an acquire load might not fully synchronize-with a seq_cst store, or something like that. (That was considered undesirable and something that could hopefully be fixed or ruled out at least in practice.) – Peter Cordes May 25 '21 at 19:17
  • @PeterCordes interesting! I am also reading your answer. – SergeyA May 25 '21 at 19:23
  • 2
    @SergeyA: Found the thing I was remembering: [P0668R5](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html) - a proposal to use weaken the ISO C++ standard's wording in places (using "simply happens before") to account for the fact that there's a complicated 3-thread interaction with release and seq_cst that in theory (with the current C++->asm mappings and the PowerPC memory model) makes a result possible that the current C++ standard forbids. We still don't have to worry about SC loads directly syncing-with release stores, though, the test case is much more complex. – Peter Cordes May 25 '21 at 20:26

1 Answers1

6

Yes, it's possible for both loads to get 0.

Within thread 1, y.load can "pass" x.store(mo_release) because they're not both seq_cst. The global total order of seq_cst operations that ISO C++ guarantees must exist only includes seq_cst operations.

(In terms of hardware / cpu-architecture for a normal CPU, the load can take a value from coherent cache before the release-store leaves the store buffer. In this case, I found it much easier to reason in terms of how I know it compiles for x86 (or to generic release and acquire operations), then apply asm memory-ordering rules. Applying this reasoning assumes that the normal C++->asm mappings are safe, and are always at least as strong as the C++ memory model. If you can find a legal reordering this way, you don't need to wade through the C++ formalism. But if you don't, that of course doesn't prove it's safe in the C++ abstract machine.)

Anyway, the key point to realize is that a seq_cst operation isn't like atomic_thread_fence(mo_seq_cst) - Individual seq_cst operations only have to recover/maintain sequential consistency in the way they interact with other seq_cst operations, not with plain acquire/release/acq_rel. (Similarly, acquire and release fences are stronger 2-way barriers, unlike acquire and release operations as Jeff Preshing explains.)


The reordering that makes this happen

That's the only reordering possible; the other possibilities are just interleavings of the program-order of the two threads. Having the store "happen" (become visible) last leads to the 0, 0 result.

I renamed one and two to r1 and r2 (local "registers" within each thread), to avoid writing things like one == 0.

// x=0 nominally executes in T1, but doesn't have to drain the store buffer before later loads
auto r1 = y.load(std::memory_order_seq_cst);   // T1b             r1 = 0 (y)
         y.fetch_add(1, std::memory_order_seq_cst);      // T2a   y = 1 becomes globally visible
         auto r2 = x.load(std::memory_order_seq_cst);    // T2b   r2 = 0 (x)
x.store(1, std::memory_order_release);         // T1a             x = 0 eventually becomes globally visible

This can happen in practice on x86, but interestingly not AArch64. x86 can do release-store without additional barriers (just a normal store), and seq_cst loads are compiled the same as plain acquire, just a normal load.

On AArch64, release and seq_cst stores use STLR. seq_cst loads use LDAR, which has a special interaction with STLR, not being allowed to read cache until the last STLR drains from the store buffer. So release-store / seq_cst load on ARMv8 is the same as seq_cst store / seq_cst load. (ARMv8.3 added LDAPR, allowing true acquire / release by letting acquire loads compile differently; see this Q&A.)

However, it can also happen on many ISAs that use separate barrier instructions, like ARM32: a release store will typically be done with a barrier and then a plain store, preventing reordering with earlier loads / stores, but not stopping reordering with later. If the seq_cst load avoids needing a full barrier before itself (which is the normal case), then the store can reorder after the load.

For example, a release store on ARMv7 is dmb ish; str, and a seq_cst load is ldr; dmb ish, so you have str / ldr with no barrier between them.

On PowerPC, as seq_cst load is hwsync; ld; cmp; bc; isync, so there's a full barrier before the load. (The HeavyWeight Sync is I think part of preventing IRIW reordering, to block store-forwarding between SMT threads on the same physical core, only seeing stores from other cores when they actually become globally visible.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Wow, I loved reading this answer! Always love your posts. I have a question though -- till now I had assumed seq_cst was strictly a superset of similar release/acquire operations. Is this basically an example where replacing seq_cst with release/acquire will make it so that this reordering is impossible? – Curious May 25 '21 at 19:51
  • Actually, it seems both loads returning 0 is still possible with acquire/release. (right?) – Curious May 25 '21 at 19:56
  • 2
    @Curious: yes, thread 1 using release and acquire instead of release and seq_cst makes it "obvious" that this reordering is allowed. https://preshing.com/20120913/acquire-and-release-semantics/ shows their semantics: reordering is allowed in one direction. – Peter Cordes May 25 '21 at 20:01
  • 1
    @Curious: Yes, seq_cst *is* basically a superset of acq_rel (or acquire or release). In the standard, a seq_cst load can sync-with a release-store, for example. The actual asm may use *different* barriers instead of just *more* barriers. Apparently there's a possible problem with the standard asm mapping on PowerPC, though: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html has some details. It involves a complicated interaction between three threads mixing a release and SC on two variables, and depending on the global total order of SC operations as well as syncs-with. – Peter Cordes May 25 '21 at 20:20
  • Would it be possible to use an atomic_thread_fence(seq_cst) to prevent this reordering? I only know how to use atomic_thread_fence(acq/rel) and that sounds like it would not work here because we want to force the release store in thread 1 from moving past the load. But a release fence only guarantees synchronization when used before an atomic write. – Curious May 26 '21 at 17:43
  • 1
    Yes, a seq_cst thread fence would keep the store ahead of the load in thread 1. (I'm not sure how the C++ formalism describes it, but in practice they always compile to a full barrier instruction, which includes a StoreLoad barrier, unlike a release fence. See [Preshing's diagram](https://preshing.com/20130922/acquire-and-release-fences/). I'm not 100% sure that weaker operations + seq_cst barriers is fully guaranteed in the ISO C++ abstract machine to recover sequential consistency the way it does in practice on real machines, but I think / assume so.) – Peter Cordes May 26 '21 at 17:49
  • Thanks! On a related note, could you actually point me to some documentation that I can use to understand how atomic_thread_fence(seq_cst) actually works? I've been trying to google this but the only good source I can find is the official std documentation (or the cppreference equivalent), but those are too hairy for me to follow :/ (Since it compiles down to just mfence on x86, I guess understandable documentation for that would be sufficient also?) – Curious May 26 '21 at 20:50
  • 1
    @Curious: In terms of reordering a thread's accesses to coherent cache, `atomic_thread_fence(mo_seq_cst)` is a full barrier, blocking all loads and stores from crossing it in either direction. (This implies draining the store buffer before any later loads/stores). Certainly in practice on most, maybe all, machines; e.g. `mfence` or `lock or byte ptr[rsp], 0` on x86, `hwsync` on Power, `dsb ish` on ARMv7 / 8. If you want to understand how it works in the C++ formalism, in terms of guarantees of a total order existing, and of creating happens-before relations, you need to read the standard. – Peter Cordes May 26 '21 at 22:31
  • A variant of the reordering in this question: a seq-cst store followed by an acquire load. Does C++ allow this load to be reordered ahead of this store? I would guess yes but I am not so confident. – zanmato Dec 07 '21 at 04:40
  • 2
    @zanmato: yup, that reordering is allowed. Easiest way to be sure is that it can compile to AArch64 `stlr` (seq_cst) / `ldapr` (acquire). Only `ldar` (seq_cst) has a special interaction with `stlr` that blocks StoreLoad reordering with `stlr`. It's generally safe to assume that the C++ -> asm mappings are at least as strong as what ISO C++ requires (i.e. not buggy), although at least one really obscure corner case has been discovered for PowerPC, involving a less-than-SC RMW on an seq_cst store or something, and an acquire or seq_cst load by a third thread... The basics are fine, though. – Peter Cordes Dec 07 '21 at 04:56
  • 1
    And I don't think x86 can produce this reordering (but it can produce the reordering in the original question). So this variant makes a nice example of hardware not able to take the room that C++ standard spares for possible optimization? – zanmato Dec 07 '21 at 05:33
  • 1
    @zanmato: correct; x86 is like most ISAs where seq_cst store is done with a full barrier as part of or after the store, preventing reordering with anything after. – Peter Cordes Dec 07 '21 at 07:01