In c++, can we guarantee a happen-before between two threads by volatile + memory fence (sfence+lfence)?

Question

Briefly speaking, can the data stored in src be correctly copied to dst, in the following code?

volatile bool flag = false;

// In thread A.
memset(mid, src, size);
__asm__ __volatile__("sfence" ::: "memory");
flag = true;

// In thread B.
while (flag == false);
__asm__ __volatile__("lfence" ::: "memory");
memset(dst, mid, size);

I'm tempted to remove the C++ tag, as this question seems to be entirely about x86 assembly. Instead I've added the x86 tag and left it to you to decide if you feel the C++ tag is appropriate or not. — John Zwinck, Mar 13 '20 at 11:37
@JohnZwinck: "happens before" is a C++ concept; the question doesn't make sense without that tag. But yes, the x86 tag is 100% required for this, too. — Peter Cordes, Mar 13 '20 at 11:38

Peter Cordes · Answer 1 · 2020-03-14T09:21:42.263

https://gcc.gnu.org/wiki/DontUseInlineAsm

Don't use this code in practice, use std::atomic<bool> with memory_order_release and acquire to get the same asm code-gen (but without the unnecessary lfence and sfence)

But yes, this looks safe, for compilers that define the behaviour of volatile such that data-race UB on the volatile bool flag isn't a problem. This is the case for compilers like GCC that can compile the Linux kernel (which rolls its own atomics using volatile like you're doing).

ISO C++ doesn't strictly require this, for example a hypothetical implementation might exist on a machine without coherent shared memory, so atomic stores would require explicit flushing. But in practice there aren't any such implementations. (There are some embedded systems where volatile stores use different or extra instructions to make MMIO work, though.)

A barrier before a store makes it a release store, and a barrier after a load makes it an acquire load. https://preshing.com/20120913/acquire-and-release-semantics/. Happens Before can be established with just a release store seen by an acquire load.

The x86 asm memory model already forbids all reordering except StoreLoad, so only compile-time reordering needs to be blocks. This will compile to asm that's the same as what you'd get from using std::atomic<bool> with mo_release and mo_acquire, except for those inefficient LFENCE and SFENCE instructions.

C++ How is release-and-acquire achieved on x86 only using MOV? explains why the x86 asm memory model is at least as strong as acq_rel.

The sfence and lfence instructions inside the asm statements are totally irrelevant, only the asm("" ::: "memory") compiler barrier part is needed. https://preshing.com/20120625/memory-ordering-at-compile-time/. Compile-time reordering only has to respect the C++ memory model, but whatever the compiler picks is then nailed down by the x86 memory model. (Program-order + store buffer with store forwarding = slightly stronger than acq_rel)

(A GNU C asm statement with no output operands is implicitly volatile so I'm omitting the explicit volatile.)

(Unless you're trying to synchronize NT stores? If so you only need sfence, not lfence.) Does the Intel Memory Model make SFENCE and LFENCE redundant? yes. A memset that internally uses NT stores will use sfence itself, to make itself compatible with the standard C++ atomics / ordering -> asm mapping used on x86. If you use a different mapping (like freely using NT stores without sfence), you could in theory break mutex critical sections unless you roll your own mutexes, too. (In practice most mutex implementations use a locked instruction in take and release, which is a full barrier.)

An empty asm statement with a memory clobber is sort of a roll-your-own equivalent to atomic_thread_fence(std::memory_order_acquire_release) because of x86's memory model. atomic_thread_fence(acq_rel) will compile to zero asm instructions, just blocking compile-time reordering.

Only seq_cst thread fence needs to emit any asm instructions to flush the store buffer and wait for that to happen before any later loads. aka a full barrier (like mfence or a locked instruction like lock add qword ptr [rsp], 0).

Don't roll your own atomics using `volatile` and inline asm

Yes, you can, and I hope you were just asking to understand how things work.

You ended up making something much less efficient than it needed to be because you used lfence (an out-of-order execution barrier that's essentially useless for memory ordering) instead of just a compiler barrier. And an unnecessary sfence.

See When should I use _mm_sfence _mm_lfence and _mm_mfence for basically the same problem but using intrinsics instead of inline asm. Generally you only want _mm_sfence() after NT-store intrinsics, and you should leave mfence up to the compiler with std::atomic.

When to use volatile with multi threading? - normally never; use std::atomic with mo_relaxed instead of volatile.

It's more like CPU defining data races as being well defined and volatile scalar operation bringing that guarantee to the C++ programmer then the compiler defining anything re: volatile. Volatile means "whatever the ABI and CPU guarantee for that type". — curiousguy, Apr 08 '20 at 04:50

Andrey Semashev · Answer 2 · 2020-03-14T10:27:21.033

1

If you're asking about C++ memory model then the answer is no, your code is not thread-safe for multiple reasons:

In C++ memory model, concurrent access to an object from multiple threads, where at least one access is a modification, constitute a data race, which is UB. The only exception to this rule are thread synchronization primitives, like atomics, mutexes, condition variables, etc. Volatile variables are not an exception.
Accesses to the variable flag are not required to be atomic, even if it is marked as volatile. This means that thread B may observe a value that was not stored in flag, including a trap representation (i.e. a representation that corresponds to no valid value of bool). Using such a value may produce undefined behavior, for example, the observed value of flag may not be equal to either true or false.
Writing or reading a volatile variable does not constitute a "happens-before" relation. In other words, it is not a compiler or hardware fence, which allows the surrounding code to be reordered around the volatile reads or writes by either the compiler or the CPU. Your attempt to introduce a fence with the asm block is not portable.

Practically speaking, your code may produce a sequence of x86 instructions that will behave as you would expect. This would be a pure coincidence given that:

sizeof(bool) == 1 on x86 on pretty much every OS, and storing and loading bytes is atomic on x86. Note that there are platforms where sizeof(bool) > 1 and thus accessing it may not be atomic.
On x86, regular stores and loads are ordered. In other words, a later store cannot be reordered before an earlier one by the CPU; same with loads. Many other CPU architectures are not so strict.
Most compilers will avoid reordering code around volatile operations. Some compilers, like MSVC, for example, even consider volatile operations as compiler fences. That is not the case with gcc though. Luckily, the __volatile__ qualifier prevents the compiler from reordering the asm block (and the fence it implements) with the surrounding code. This will make the asm blocks with hardware fences effective with that compiler and compatible ones.

But I'll repeat, if the code works, then only by coincidence. It doesn't have to, even on x86, as the compiler is free to optimize this code as it wants to, since, as far as it is concerned, no thread concurrency is involved here. You may rely on guarantees provided by the specific compiler, such as non-standard semantics of volatile, intrinsics and asm blocks, but at that point your program is not portable C/C++ and is written for the specific compiler, possibly with a specific set of command line switches.

edited Mar 14 '20 at 10:27

answered Mar 13 '20 at 12:26

Andrey Semashev

10,046
1
17
27

which platforms have sizeof(bool) > 1? – SPD Mar 13 '20 at 17:35
@SPD For example, Sun Solaris on SPARC (SunPro C/C++ compiler) had `sizeof(bool) == 4`. I'm not sure how it ts now, in Oracle era, as I don't use this platform anymore. I imagine, it should still maintain the same size for ABI compatibility. – Andrey Semashev Mar 13 '20 at 18:28
It's more than "purely a coincidence", it's behaviour that the Linux kernel relies on and thus is well supported by compilers that support GNU C inline asm. The key to ordering is the `"memory"` clobber in the asm statement, not just it being `volatile` - that wouldn't order it wrt. the memset only the `volatile bool` access. – Peter Cordes Mar 13 '20 at 21:41
This answer is correct from a language-lawyer and portability POV, but volatile + `asm()` was how people rolled their own atomics before C++11, and is de-facto supported even though ISO C++ guarantees nothing. Non-atomicity on `volatile bool` is also highly unlikely. In practice systems use integer `0` and `1` bit patterns that differ only in a single byte (not 0 / -1), so tearing couldn't produce a different value even for a multi-byte `bool`. (ping @SPD). Also, SPARC has atomic loads/stores of aligned 4-byte words so that's not actually a problem. (BTW, I updated my answer some.) – Peter Cordes Mar 13 '20 at 21:44
@PeterCordes Linux kernel is not written in C++. It is not in pure C either. It is a mixture of asm and a subset of C *as implemented by a single compiler with specific switches* (though recently support for a second compiler was added, which required great effort on both sides). So clearly Linux kernel is not a beacon of portability, though there are reasons why it is written this way. If you're asking "can a particular compiler be persuaded to generate the desired instructions" then the answer is yes, but that is not guaranteed by language, which means coincidence. – Andrey Semashev Mar 14 '20 at 08:33
BTW, `volatile` was never a way to implement atomics, not even before C++11. It was inline assembler and compiler intrinsics. It still is, for anyone wanting to implement atomics instead of using `std::atomic`. – Andrey Semashev Mar 14 '20 at 08:39
> a subset of C -- That should say "a subset of C with non-standard extensions". – Andrey Semashev Mar 14 '20 at 08:41
C vs. C++ is not really relevant; ISO C and C++ say basically the same thing about `volatile`, and more importantly real implementations that can compile the OP's code (gcc/clang/ICC and maybe Sun's compiler) all compile it in a well-defined way, especially when we're just talking about x86. That's why I used Linux as an example. – Peter Cordes Mar 14 '20 at 08:53
`volatile` was and is a key part of hand-rolled atomics (that don't use compiler intrinsics like `__atomic_load_n`). Note that [GCC's legacy `__sync` builtins](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html) didn't even have a plain load, only RMWs. You need to make sure the compiler does a load exactly once: not zero times, but also not assuming it can reload the same value later and lead to a variable that's both true and false. I think you can use `asm("":::"memory")` before and after an `int tmp = shared` load instead of making `shared` volatile, though, if you want. – Peter Cordes Mar 14 '20 at 08:58
https://lwn.net/Articles/793253/#Invented%20Loads points out this possible problem, and that using `volatile` can also prevent performance problems from having a compiler hoist a load out of an `if` leading to false sharing. Anyway, `volatile` might be avoidable, but back in the day (without GNU C/C++ `__atomic` builtins) it's a much better solution than putting compiler memory barriers before/after every atomic load. Anyway, of course this is all "on a particular compiler", or a *set* of compilers. Specifically ones that support GNU C inline asm syntax which the OP is using, and target x86. – Peter Cordes Mar 14 '20 at 09:03
> all compile it in a well-defined way -- Defined by what, exactly? C/C++ does not guarantee neither atomicity nor ordering. Particular compilers may guarantee that, but at this point you're writing code for a particular compiler, not C or C++. Obviously, when you're using asm and intrinsics (which naturally provide compiler-specific guarantees) and rely on non-standard `volatile` semantics, you're also not writing in pure C/C++ and target a specific compiler. That is the point I'm trying to make. The OP asked about C++, and in C++ his code is not thread-safe. – Andrey Semashev Mar 14 '20 at 09:07
Yes, exactly, you're coding for the GNU dialect of C/C++, not portable ISO C++. I think it's obvious the OP knows that, given that they're using GNU C inline asm. Of course a DeathStation 9000 x86 C++ could break your code by having actual problematic UB from tearing of `volatile` by using separate byte loads/stores to access a multi-byte `bool`. – Peter Cordes Mar 14 '20 at 09:13
And of course you shouldn't write code like the OP did; the largest heading size in my answer says that and links to a whole answer explaining that you can get the same behaviour 100% portable with std::atomic with less than seq_cst memory order, if what you really want is to avoid the `mfence` that's part of a seq_cst store on x86. Pointing out that this code depends on the semantics of the `asm` statements hardly seems useful; of course it does and that's the whole point of the question. – Peter Cordes Mar 14 '20 at 09:14
I just noticed that your answer goes as far as actually being wrong: *the compiler is allowed to move the asm block around the assignment or reading of `flag`*. Nope, that's what the `"memory"` clobber is for. If you wanted to say that it's only portable to compilers that understand GNU C inline asm syntax, that would be fine. But you appear to be claiming there might be compilers that would parse this code and generate unsafe asm. Any reasonable person would consider that a buggy implementation of that GNU C extension, and AFAIK no such compiler exists in real life. – Peter Cordes Mar 14 '20 at 09:17
The "memory" clobber says the asm block may affect data state in memory (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers). Which is what makes it a compiler fence, as the compiler cannot assume any cached values are valid past such asm statement. The "memory" clobber does not prevent the asm block to be moved though. For example, it can be moved outside a loop body. The `__volatile__` qualifier is what prevents that. – Andrey Semashev Mar 14 '20 at 10:08
I've updated my answer to better explain the `__volatile__` effect and make my point re portability more clear. It should also address the confusion about reordering the asm block with other code. – Andrey Semashev Mar 14 '20 at 10:22
I *think* that only the possibility without volatile (explicit, or implicit from having no output operands) is just that the asm optimizes away; the memory clobber is basically like a black-box function call so I don't think just moving it wrt. accesses to escaped or non-local vars is legal for GNU C. https://godbolt.org/z/ndyc-4 shows that gcc at least doesn't do that in one case where it would allow dead-store elimination for a non-volatile `int` global. With just volatile but not a memory clobber, it would reorder. BTW, I didn't get a notification on your replies because you didn't @ me. – Peter Cordes Mar 14 '20 at 10:54
I still disagree with *But I'll repeat, if the code works, then only by coincidence.* If you consider compiling it with a compiler that supports the language extension you're using a coincidence, then sure, but that seems like deliberately misunderstanding the question as asking about pure ISO C++. (Or maybe you're right and the OP doesn't know `asm("":::"memory")` isn't portable?) The phrasing you're using to make your point about GNU C inline asm being non-portable is too strong, to the point of being disingenuous. This isn't even tagged "language-lawyer". – Peter Cordes Mar 14 '20 at 10:59
*2. On x86, regular stores and loads are ordered.* Yes, but a barrier instruction after a plain load, or before a plain store, can still create an acquire or release operation. The name of the instruction depends on the ISA. For x86, the empty string is fine. For PowerPC, `lwsync`. The OP happened to pick valid but useless x86 instructions which sound like they'd be the right things so they have the right concept, wrong x86 details. – Peter Cordes Mar 14 '20 at 11:02
@PeterCordes I read the question as a question primarily about C++ memory model. That the OP used gcc asm syntax does not necessarily indicate he's targeting that specific compiler as the code sample seems to be used simply to illustrate the idea. That said, in my answer i covered not only pure C++ language aspects, but also mentioned compiler extensions that may make this code work. I think, my answer is correct in this regard. – Andrey Semashev Mar 14 '20 at 11:50
Regarding the "coincidence" wording, I speak from a standpoint where a C++ program is portable by default. That a C++ program is compiled by a particular compiler is a coincidence to me. That a code that relies on UB works as the programmer intended is a coincidence as well. Of course, if a programmer intends to compile the program with one compiler only, which provides additional guarantees compared to C++ standard, that is not a coincidence to him. – Andrey Semashev Mar 14 '20 at 11:56
Extensions to C++ aren't what I'd describe as UB. At least not when they use new syntax so they only compile in the first place on implementations that define the behaviour. Sure it's not portable, but that's different from something like union type-punning in C++ which compiles everywhere. – Peter Cordes Mar 14 '20 at 14:56
@PeterCordes Accessing `volatile` from multiple threads does not introduce a new syntax and is UB. – Andrey Semashev Mar 14 '20 at 15:27
Ah yes, that's fair. Like I said in my answer, correctness of that relies on the target machine defining the behaviour (i.e. coherent shared memory) because the C++ abstract machine doesn't define it beyond "must not happen as written". A DeathStation 9000 on x86 could introduce tearing of `volatile`, and use an ABI where `bool` is 65-byte 0 / -1 so it necessarily spans 2 cache lines, and tearing is an actual problem. I still wouldn't go as far as calling it a "coincidence" that it works on x86, though. An implementation has to really try to break this code. – Peter Cordes Mar 14 '20 at 18:24
But yes, it does depend on various assumptions about how a "normal" implementation works, the same assumptions that most lock-free code made before C++11. These assumptions are universally true on real x86 implementations, and I'm pretty sure on other ISAs too. So it's hardly a coincidence, just not strictly language-lawyer compliant. I don't disagree with the point I think you're making, just the language / wording you're using. – Peter Cordes Mar 14 '20 at 18:27

In c++, can we guarantee a happen-before between two threads by volatile + memory fence (sfence+lfence)?

2 Answers2

https://gcc.gnu.org/wiki/DontUseInlineAsm

Don't roll your own atomics using volatile and inline asm

Don't roll your own atomics using `volatile` and inline asm