Don't use this code in practice, use std::atomic<bool>
with memory_order_release
and acquire
to get the same asm code-gen (but without the unnecessary lfence and sfence)
But yes, this looks safe, for compilers that define the behaviour of volatile
such that data-race UB on the volatile bool flag
isn't a problem. This is the case for compilers like GCC that can compile the Linux kernel (which rolls its own atomics using volatile
like you're doing).
ISO C++ doesn't strictly require this, for example a hypothetical implementation might exist on a machine without coherent shared memory, so atomic stores would require explicit flushing. But in practice there aren't any such implementations. (There are some embedded systems where volatile
stores use different or extra instructions to make MMIO work, though.)
A barrier before a store makes it a release store, and a barrier after a load makes it an acquire load. https://preshing.com/20120913/acquire-and-release-semantics/. Happens Before can be established with just a release store seen by an acquire load.
The x86 asm memory model already forbids all reordering except StoreLoad, so only compile-time reordering needs to be blocks. This will compile to asm that's the same as what you'd get from using std::atomic<bool>
with mo_release
and mo_acquire
, except for those inefficient LFENCE and SFENCE instructions.
C++ How is release-and-acquire achieved on x86 only using MOV? explains why the x86 asm memory model is at least as strong as acq_rel.
The sfence
and lfence
instructions inside the asm statements are totally irrelevant, only the asm("" ::: "memory")
compiler barrier part is needed. https://preshing.com/20120625/memory-ordering-at-compile-time/. Compile-time reordering only has to respect the C++ memory model, but whatever the compiler picks is then nailed down by the x86 memory model. (Program-order + store buffer with store forwarding = slightly stronger than acq_rel)
(A GNU C asm
statement with no output operands is implicitly volatile so I'm omitting the explicit volatile
.)
(Unless you're trying to synchronize NT stores? If so you only need sfence
, not lfence
.)
Does the Intel Memory Model make SFENCE and LFENCE redundant? yes. A memset that internally uses NT stores will use sfence
itself, to make itself compatible with the standard C++ atomics / ordering -> asm mapping used on x86. If you use a different mapping (like freely using NT stores without sfence), you could in theory break mutex critical sections unless you roll your own mutexes, too. (In practice most mutex implementations use a lock
ed instruction in take and release, which is a full barrier.)
An empty asm statement with a memory clobber is sort of a roll-your-own equivalent to atomic_thread_fence(std::memory_order_acquire_release)
because of x86's memory model. atomic_thread_fence(acq_rel)
will compile to zero asm instructions, just blocking compile-time reordering.
Only seq_cst thread fence needs to emit any asm instructions to flush the store buffer and wait for that to happen before any later loads. aka a full barrier (like mfence
or a lock
ed instruction like lock add qword ptr [rsp], 0
).
Don't roll your own atomics using volatile
and inline asm
Yes, you can, and I hope you were just asking to understand how things work.
You ended up making something much less efficient than it needed to be because you used lfence
(an out-of-order execution barrier that's essentially useless for memory ordering) instead of just a compiler barrier. And an unnecessary sfence
.
See When should I use _mm_sfence _mm_lfence and _mm_mfence for basically the same problem but using intrinsics instead of inline asm. Generally you only want _mm_sfence()
after NT-store intrinsics, and you should leave mfence
up to the compiler with std::atomic
.
When to use volatile with multi threading? - normally never; use std::atomic
with mo_relaxed
instead of volatile
.