Why does this `std::atomic_thread_fence` work

Question

Firstly I want to list some of my undertandings regarding to this, please correct me if I'm wrong.

a MFENCE in x86 can ensure a full barrier
Sequential-Consistency prevents reordering of STORE-STORE, STORE-LOAD, LOAD-STORE and LOAD-LOAD

This is according to Wikipedia.
std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

This is according to Alex's answer, "Loads May Be Reordered with Earlier Stores to Different Locations"(for x86) and mfence will not always be added.

Whether a std::memory_order_seq_cst indicates Sequential-Consistency? According to point 2/3, it seems not correct to me. std::memory_order_seq_cst indicates Sequential-Consistency only when
1. at least one explicit MFENCE added to either LOAD or STORE
2. LOAD (without fence) and LOCK XCHG
3. LOCK XADD ( 0 ) and STORE (without fence)
otherwise there will still be possible reorders.

According to @LWimsey's comment, I made a mistake here, if both the LOAD and STORE are memory_order_seq_cst, there's no reorder. Alex may indicated situations where non-atomic or non-SC is used.
std::atomic_thread_fence(memory_order_seq_cst) always generates a full-barrier

This is according to Alex's answer. So I can always replace asm volatile("mfence" ::: "memory") with std::atomic_thread_fence(memory_order_seq_cst)

This is quite strange to me, because a memory_order_seq_cst seems to have quite a difference usage between atomic functions and fence functions.

Now I come to this code in header file of MSVC 2015's standard library, which implements std::atomic_thread_fence

inline void _Atomic_thread_fence(memory_order _Order)
    {   /* force memory visibility and inhibit compiler reordering */
 #if defined(_M_ARM) || defined(_M_ARM64)
    if (_Order != memory_order_relaxed)
        {
        _Memory_barrier();
        }

 #else
    _Compiler_barrier();
    if (_Order == memory_order_seq_cst)
        {   /* force visibility */
        static _Uint4_t _Guard;
        _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
        _Compiler_barrier();
        }
 #endif
    }

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE, or what has actually done to enable an equivalent mechanism like MFENCE, because a _Compiler_barrier() is obviously not enough here for a full memory barrier, or this statement works somewhat similar to point 3?

About your point 3 "_std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder_".. It does guarantee that, but only when both operations are tagged as such. — LWimsey, Jan 18 '18 at 11:02
@LWimsey Do you mean if I use `atomic_store(memory_order_seq_cst )` and `atomic_load(memory_order_seq_cst )`, there'll be no reorder. However if I use `atomic_store(memory_order_release)` and `atomic_load(memory_order_acquire)`, then I should add a `MFENCE` to either of them, in order to avoid STORE-LOAD reorder? — calvin, Jan 18 '18 at 13:45
Yes, if you use `seq_cst` on both the `store` and the `load`, all threads will observe both operations in that order. The same for inserting an `atomic_thread_fence(seq_cst)` in between (You can/should not really insert an `MFENCE`, leave that to the compiler). — LWimsey, Jan 18 '18 at 14:07
@calvin It actually depends on whether or not you talk about the same memory location. If you do an `x.store(1, memory_order_release); x.load(memory_order_acquire);` then no fence would needed (although such a construct would be highly questionable, so you probably meant them to be on different memory locations). — Carlo Wood, Feb 16 '18 at 12:08
@LWimsey 1) All threads? Which threads? 2) Fence between what and what? Other threads must use the fence? — curiousguy, Dec 10 '19 at 05:11
@LWimsey: you can use `atomic_thread_fence(seq_cst)` after some normal stores, before an atomic `.store(val, mo_relaxed)`, to effectively create a release store. Or after a store to make it more like a `.store(val, mo_seq_cst)`. ISO C++ doesn't define things in terms of reordering or not so I'm hesitant to say it stops it from reordering with later atomic loads and stores. For the first use-case to work, it does need to block compile-time reordering with non-atomic operations in some cases. — Peter Cordes, Apr 21 '20 at 07:48
And yes, as an implementation detail it's allowed for it to be stronger, and actually block all reordering including non-atomic, at compile time. (And of course run-time with `mfence`) — Peter Cordes, Apr 21 '20 at 07:49
@PeterCordes An `atomic_thread_fence(seq_cst)` _after_ a `.store(val, mo_relaxed)` does not make it a `.store(val, mo_seq_cst)` because it does not have `release` semantics and that's required (unless you take into account `x86` specs) — LWimsey, Apr 21 '20 at 19:17
@LWimsey: I was trying not to fill up two 600 char comments saying that in detail, but "more like mo_seq_cst" was too vague. So yes, you could barrier before *and* after a .store(mo_relaxed) to get an inefficient seq_cst store. Or I guess just after a `release` store, at least on x86 if not in portable ISO C++. Perhaps not exactly equivalent on other ISAs that aren't multi-copy-atomic (notably POWER); I forget if POWER needs stronger barriers before a seq_cst store than before a release store. — Peter Cordes, Apr 21 '20 at 19:28
@PeterCordes Barriers around a relaxed operation is an (inefficient) way to prevent reordering, that is to get acquire resp. release semantics; but it doesn't make the operation globally sequentially consistent. At most the fences are "sequentially consistent" (but then, you don't observe fences, only stores, directly, and loads, indirectly, via the side effects the code following them produce). — curiousguy, May 03 '20 at 04:07

Peter Cordes · Accepted Answer · 2021-10-19T08:58:40.410

5

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE

This compiles to an xchg instruction with a memory destination. This is a full memory barrier (draining the store buffer) exactly¹ like mfence.

With compiler barriers before and after that, compile-time reordering around it is also prevented. Therefore all reordering in either direction is prevented (of operations on atomic and non-atomic C++ objects), making it more than strong enough to do everything that ISO C++ atomic_thread_fence(mo_seq_cst) promises.

For orders weaker than seq_cst, only a compiler barrier is needed. x86's hardware memory-ordering model is program-order + a store buffer with store forwarding. That's strong enough for acq_rel without the compiler emitting any special asm instructions, just blocking compile-time reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/

Footnote 1: exactly enough for the purposes of std::atomic. Weakly ordered MOVNTDQA loads from WC memory may not be as strictly ordered by locked instructions as by MFENCE.

Which is a better write barrier on x86: lock+addl or xchgl?
Does lock xchg have the same behavior as mfence? - equivalent for std::atomic purposes, but some minor differences that might matter for a device driver using WC memory regions. And perf differences. Notably on Skylake where mfence blocks OoO exec like lfence
Why is LOCK a full barrier on x86?

Atomic read-modify-write (RMW) operation on x86 are only possible with a lock prefix, or xchg with memory which is like that even without a lock prefix in the machine code. A lock-prefixed instruction (or xchg with mem) is always a full memory barrier.

Using an instruction like lock add dword [esp], 0 as a substitute for mfence is a well-known technique. (And performs better on some CPUs.) This MSVC code is the same idea, but instead of a no-op on whatever the stack pointer is pointing-to, it does an xchg on a dummy variable. It doesn't actually matter where it is, but a cache line that's only ever accessed by the current core and is already hot in cache is the best choice for performance.

Using a static shared variable that all cores will contend for access to is the worst possible choice; this code is terrible! Interacting with the same cache line as other cores is not necessary to control the order of this core's operations on its own L1d cache. This is completely bonkers. MSVC still apparently uses this horrible code in its implementation of std::atomic_thread_fence(), even for x86-64 where mfence is guaranteed available. (Godbolt with MSVC 19.14)

If you're doing a seq_cst store, your options are mov+mfence (gcc does this) or doing the store and the barrier with a single xchg (clang and MSVC do this, so the codegen is fine, no shared dummy var).

Much of the early part of this question (stating "facts") seems wrong and contains some misinterpretations or things that are so misguided they're not even wrong.

std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

C++ guarantees order using a totally different model, where acquire loads that see a value from a release store "synchronize with" it, and later operations in the C++ source are guaranteed to see all the stores from code before the release store.

It also guarantees that there's a total order of all seq_cst operations even across different objects. (Weaker orders allow a thread to reload its own stores before they become globally visible, i.e. store forwarding. That's why only seq_cst has to drain the store buffer. They also allow IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

Concepts like StoreLoad reordering are based on a model where:

All inter-core communication is via committing stores to cache-coherent shared memory
Reordering happens inside one core between its own accesses to cache. e.g. by the store buffer delaying store visibility until after later loads like x86 allows. (Except a core can see its own stores early via store forwarding.)

In terms of this model, seq_cst does require draining the store buffer at some point between a seq_cst store and a later seq_cst load. The efficient way to implement this is to put a full barrier after seq_cst stores. (Instead of before every seq_cst load. Cheap loads are more important than cheap stores.)

On an ISA like AArch64, there are load-acquire and store-release instructions which actually have sequential-release semantics, unlike x86 loads/stores which are "only" regular release. (So AArch64 seq_cst doesn't need a separate barrier; a microarchitecture could delay draining the store buffer unless / until a load-acquire executes while there's still a store-release not committed to L1d cache yet.) Other ISAs generally need a full barrier instruction to drain the store buffer after a seq_cst store.

Of course even AArch64 needs a full barrier instruction for a seq_cst fence, unlike a seq_cst load or store operation.

std::atomic_thread_fence(memory_order_seq_cst) always generates a full-barrier

In practice yes.

So I can always replace asm volatile("mfence" ::: "memory") with std::atomic_thread_fence(memory_order_seq_cst)

In practice yes, but in theory an implementation could maybe allow some reordering of non-atomic operations around std::atomic_thread_fence and still be standards-compliant. Always is a very strong word.

ISO C++ only guarantees anything when there are std::atomic load or store operations involved. GNU C++ would let you roll your own atomic operations out of asm("" ::: "memory") compiler barriers (acq_rel) and asm("mfence" ::: "memory") full barriers. Converting that to ISO C++ signal_fence and thread_fence would leave a "portable" ISO C++ program that has data-race UB and thus no guarantee of anything.

(Although note that rolling your own atomics should use at least volatile, not just barriers, to make sure the compiler doesn't invent multiple loads, even if you avoid the obvious problem of having loads hoisted out of a loop. Who's afraid of a big bad optimizing compiler?).

Always remember that what an implementation does has to be at least as strong as what ISO C++ guarantees. That often ends up being stronger.

edited Oct 19 '21 at 08:58

answered Apr 23 '20 at 08:38

Peter Cordes

328,167
45
605
847

Any reason to prefer exactly `XCHG` and exactly static storage duration variable (which is actually not stored on stack)? I suspect it is due to having smallest encoding (no `LOCK` prefix, simple address obtaining). And so change to `LOCK CMPXCHG` in later MSVC is not useful, though not very harmful. – Alex Guteniev Apr 23 '20 at 09:02
I know they also don't use `mfence` because they run on CPUs without `mfence` (at least used to run in theory), so even if `mfence` had better performance, they probably would not use it – Alex Guteniev Apr 23 '20 at 09:07
@AlexGuteniev: oh holy crap, I only skimmed the code and only saw what I expected to see. Not the insane code that's actually there (`static`), which is vastly worse and creates contention between separate cores doing a barrier. `mfence` is baseline for x86-64 so that would be an option, and smaller code than having to zero a reg for `xchg`. You might still choose a dummy `xchg` or other locked operation as a more efficient barrier for some CPUs. But a dummy xchg with a *stack* variable would be totally fine, perhaps making it `volatile` if necessary to stop it from optimizing away. – Peter Cordes Apr 23 '20 at 09:17
xor-zeroing EAX + `xchg [rsp+disp8]` is I think equal size to `lock add [rsp+disp8], 0`. (2-byte xor-zeroing vs. 1 byte lock + one byte for the imm8=0 on top of the same opcode + modrm + disp8 for xchg or add) – Peter Cordes Apr 23 '20 at 09:19
I thought if you increment a dummy stack atomic by 1, and would discard returned value, compilers would be able to use `lock inc`, but apparently they are not. – Alex Guteniev Apr 23 '20 at 09:55
1

@AlexGuteniev: GCC uses `lock inc` for atomic `++` when you compile with a tune option that tunes for a CPU where `inc` is totally fine on registers. e.g. `-mtune=haswell`. https://godbolt.org/z/fWMhMB. Unfortunately it uses memory-destination `inc` even in the non-locked case, which costs an extra uop on Haswell. [INC instruction vs ADD 1: Does it matter?](https://stackoverflow.com/q/36510095). – Peter Cordes Apr 23 '20 at 10:01
@AlexGuteniev: MSVC uses `xadd` with a value in EAX, even when you aren't using the return value of `var++` or `var += 0;`. So that's dumb. At least the MSVC under WIINE install on Godbolt which is all that's available today (https://github.com/mattgodbolt/compiler-explorer/issues/1929). Presumably it's just as bad in current MSVC on real Windows, missing that peephole optimization. – Peter Cordes Apr 23 '20 at 10:03
I've reported the code being sub-optimal here: https://github.com/microsoft/STL/issues/739 , further discussion of better code for MSVC would be more useful there – Alex Guteniev Apr 23 '20 at 11:58
1

I brought this to the attention of Boost.Atomic maintainer, today he made [a commit](https://github.com/boostorg/atomic/commit/559eba81af71386cedd99f170dc6101c6ad7bf22). The most interesting part is an explanation **against** `lock or [esp], 0` in this link: https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ . In short, there may be false data dependency in case of registers spilling. – Alex Guteniev Jun 11 '20 at 20:14
@AlexGuteniev: Oh cool, using the dword or qword below the current stack point is something I suggested for GCC (https://gcc.gnu.org/legacy-ml/gcc-patches/2016-05/msg02289.html), but they reverted the change after finding that some analysis tools like valgrind complained. I was curious what the real perf impact was, since in theory it's a full barrier and ends with the store buffer drained. But in practice the CPU can probably still actually start loading sooner. – Peter Cordes Jun 11 '20 at 20:20
I'm not sure I fully understand what happens in linked example either. I also now noticed that they have 4-byte dummy increment (lock **addl** $0x0,(%rsp)), but 8-byte load mov (%rsp),**%rcx**. Does the problem happen on 8-byte dummy `or` ? – Alex Guteniev Jun 12 '20 at 03:09
@AlexGuteniev: `lock add` vs. `lock or` is totally irrelevant. I doubt that the operand-size matters either; by definition atomic RMWs are a full barrier so I assume they can't store-forward to a later load. Non-atomic `orl $0, (%rsp)` *would* cause a store-forwarding stall if an 8-byte reloaded `(%rsp)` right away, but I assume the data just has to come from L1d cache after a `lock`ed operation that overlaps any bytes with the load. So fully overlapping with one qword local var vs. partially overlapping the low dword doesn't matter. – Peter Cordes Jun 12 '20 at 03:14
@AlexGuteniev: However, narrower means you only overlap *one* dword, not two. `lock orb $0, (%rsp)` might be better in that respect, in case you happen to only cause a penalty for `char_array[0]` instead of `char_array[0..3]`. Of course qword operand size is a waste of a REX prefix and should be avoided for code-size reasons. Oh also, the read side of an atomic RMW is maybe speculatively executed early. *It* might benefit from store-forwarding, so narrower operand size makes that more compatible with more types that might be locals. Byte being best, unless there's any downside to it. – Peter Cordes Jun 12 '20 at 03:15
I meant the case which breaks store-forwarding by loading 8 bytes after storing 4 bytes. As you noted, with atomic it should not be relevant, but if the information on given link is correct, then I suspect it might be somehow relevant. As I said, I don't understand (and don't care that much to try to conduct own experiments/investigations right now) – Alex Guteniev Jun 12 '20 at 03:43
I'm also wondering if `_InterlockedOr(reinterpret_cast(_AddressOfReturnAddress()), 0);` is vulnerable to the issue, but it is more theory question, I'm fine with extra variable. – Alex Guteniev Jun 12 '20 at 04:02
@AlexGuteniev: TL:DR: `lock orq $0, (%rsp)` is the worst option. It has at least a code-size downside, and maybe performance, vs. `lock orb $0, (%rsp)` or `lock orl`. I'd expect that `lock orb` is best, but `lock orb $0, -1(%rsp)` might be even better. Or since space below RSP is dead on Windows x64 (no red zone), `lock orb %al, -1(%rsp)` is possible to save a byte, but with a false dependency on RAX. Doing a locked op on the return address could work; branch prediction hides the latency of actually reloading it. – Peter Cordes Jun 12 '20 at 04:02
I think it deserves separate question and answer, instead of being in comments: https://stackoverflow.com/q/62337376/2945027 – Alex Guteniev Jun 12 '20 at 04:12
@PeterCordes Armv8.3+ supports reordering of _independent_ store-release and load-acquire (`LDAPRB`, `LDAPRH`, `LDAPR`), also [reference](https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/armv8-a-architecture-2016-additions#Memory_consistency_model) (see Memory Consistency Model); Why does the above only satisfy acq_rel, and not seq_cst guarantees? i.e. we get release/acquire semantics, plus total order of stores, even though independent store/load may not be visible in source code order. – Daniel Nitzan Jan 27 '21 at 09:08
^^ Just noticed that [the ref manual](https://developer.arm.com/documentation/ddi0487/latest/) section C6.2.102 on LDAPR says: _The reading of a value written by a Store-Release by a Load-AcquirePC instruction by the same observer does not make the write of the Store-Release globally observed_; So STLR/LDAPR pair is plain old store-rel/read-acq without ordering guarantees; The store buffer is not drained after the store-rel or before the read-acq, not even for dependent load/store. If it was drained, then I guess it would satisfy seq_cst. – Daniel Nitzan Jan 27 '21 at 16:49
@DanielNitzan: Yeah, it's not seq_cst for exactly the same reason x86 needs `mfence` or a dummy `lock`ed operation as a full-barrier after seq_cst stores. x86's standard memory model makes all loads + stores acq / rel, with a total store order. It seems the point of v8.3 LDAPR and friends is to get acquire without draining the store buffer (seq_cst), for higher performance when used with release / seq_cst stores to other objects. Including allowing store-forwarding from an STLR. So it seems STLR / LDAPR is exactly like x86's standard memory model. – Peter Cordes Jan 27 '21 at 18:16
@PeterCordes ARM has gone so far as to provide a cheap seq_cst based on stlr/ldar, by having ldar inspect the store buffer for any stlr and flush the buffer if any exists. I'm wondering why they haven't gone the extra mile to make it even cheaper by having ldar look for stlr to the same address, and only if such stlr exists, then flush the store buffer. Obviously this complicates the circuitry, but it would be exactly seq_cst, no more no less. – Daniel Nitzan Jan 27 '21 at 19:04
@DanielNitzan: That would allow StoreLoad reordering and not be seq_cst. https://preshing.com/20120515/memory-reordering-caught-in-the-act/. I'm pretty sure ISO C++ `seq_cst` is strong enough to disallow the reordering in Preshing's example, by dint of requiring a total order for *all* seq_cst store and load operations. I think ARM's `stlr` / `ldar` already is about (or exactly?) as weak as you can be and still be seq_cst. – Peter Cordes Jan 27 '21 at 19:18
@PeterCordes it's probably my misunderstanding, but Preshing's caught in the act doesn't in and of itself contradict C++11's seq_cst, which says nothing about StoreLoad reordering. It only requires read-acq/store-rel + TSO? – Daniel Nitzan Jan 27 '21 at 19:32
@DanielNitzan: On x86 (read-acq / store-rel + TSO), you *do* see memory reordering. On a sequentially-consistent machine, you wouldn't; that's the literal definition of *sequential* consistency (everything appears to have happened in an order compatible with some interleaving of program order). By using full barriers on x86, we can recover seq_cst (demonstrating the point of barriers). You would need to invent some alternate way this litmus test could guarantee no reordering, but without flushing the store buffer between a seq_cst store and seq_cst load, for your idea to work. – Peter Cordes Jan 27 '21 at 19:45
@PeterCordes right, frankly ISO C++ definition of seq_cst confused me. I just couldn't find where the formal definition states that all operations need to take place in program order. I will try to dissect it more thoroughly just for fun – Daniel Nitzan Jan 27 '21 at 22:40
@DanielNitzan: See https://eel.is/c++draft/intro.multithread#intro.races-note-21 for a statement that the C++ rules *do* add up to giving that guarantee. (For data-race-free programs). See also https://eel.is/c++draft/atomics.order#note-5, and most importantly the preceding rule about the existence of a single total order on *all* seq_cst operations, and rules about that order being compatible with other orders like some forms of happens-before, including sequenced-before (program-order). – Peter Cordes Jan 27 '21 at 23:22
It's a bit of a mess to actually fully follow, but it's well known that C++ provides SC-DRF: sequential consistency (following the standard definition of that term which far precedes C++11), for Data Race Free programs. [Memory Model in C++ : sequential consistency and atomicity](https://stackoverflow.com/q/38425920) – Peter Cordes Jan 27 '21 at 23:23
@PeterCordes thanks for the pointers; I think that _strongly happens-before_ is used to stress the correct sequential order, as in _if A and B are memory_order::seq_cst operations and A strongly happens before B, then A precedes B in S_ – Daniel Nitzan Jan 28 '21 at 06:28
This was brought up in SOF before, and also helps: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html, especially the discussion about _sequenced before_ / _strongly happens before_ – Daniel Nitzan Jan 28 '21 at 06:33
@DanielNitzan: Thanks for the link, I hadn't realized there were ISAs where `release` stores had problems synchronizing with `seq_cst` loads (although I guess only indirectly). Yuck, I had assumed that seq_cst loads could safely be assumed to be a stronger version of acquire, like they are on most ISAs. On x86, using release for pure stores is a win, but there's no motivation to weaken loads until you care about performance on other ISAs. Except now you need to match acquire with release to even get correctness, at least in cases of transitive relations... Will have to look at that more. – Peter Cordes Jan 28 '21 at 07:22
@PeterCordes yeah, also the example program given to demonstrate broken compilation when leading fences are used, totally blows my mind. It just amazes me how scientists come to think of such complex scenarios given they rarely (never?) occur in practice. – Daniel Nitzan Jan 29 '21 at 12:36

score 2 · Answer 2 · edited Apr 23 '20 at 09:57

2

It sounds like that the x86 implementation of the atomic STORE/LOAD operations takes advantage of the strongly-ordered asm memory model of the x86 architecture. See also C/C++11 mappings to processors

The situation is very different on ARM, which the code snippet in the question demonstrates.

Herb Sutter made a great presentation on this for CPPCON 2014: https://www.youtube.com/watch?v=c1gO9aB9nbs

edited Apr 23 '20 at 09:57

Peter Cordes

328,167
45
605
847

answered Apr 23 '20 at 07:49

Yuri Beard

21
2

2

@mohammedwazeem - don't use `code formatting` for highlighting. See [What could be done to stop code formatting misuse?](https://meta.stackexchange.com/q/172458). Your edit is harmful, and the people that approved it should have rejected it. – Peter Cordes Apr 23 '20 at 09:49
@PeterCordes Got it. Thanks for your valuable information. – mohammed wazeem Apr 23 '20 at 10:43

score 0 · Answer 3 · answered Dec 12 '19 at 06:51

0

Just because a C++ fence is implemented as producing a particular assembly level fence, and in general needs to produce one, does not mean that you can go around hunting inline asm and replace explicit asm fence with C++ instructions!

C++ thread fences are called std::atomic_thread_fence for a reason: they have a defined function solely in relation with std::atomic<> objects.

You absolutely can't use these to order normal (non-atomic) memory operations.

std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

It does but only with respect to other std::memory_order_seq_cst operations.

answered Dec 12 '19 at 06:51

curiousguy

8,038
2
40
58

But how an exchange is synchronized with other atomics, in particular with `std::memory_order_seq_cst` load of another atomic (which is plain load in MSVC)? And I see it was replaced with `InterlockedCompareExhcange`, so didn't exchange work? – Alex Guteniev Apr 16 '20 at 07:10
"_But how an exchange is synchronized with other atomics_" Specifically: an exchange done how? – curiousguy Apr 16 '20 at 14:59
I mean, this `atomic_thread_fence(seq_cst)` is implemented via exchange. and `atomic::load(seq_cst))` is implemented via simple load with only compiler barriers around. How are they synchronized? – Alex Guteniev Apr 17 '20 at 04:48

Why does this `std::atomic_thread_fence` work

3 Answers3

Linked