The object might be const
It wouldn't be safe for static const int val = 1;
living in read-only memory. The unconditional-store version will segfault trying to write to read-only memory.
The version that checks first is safe to call on that object in the C++ abstract machine (via const_cast
), so the optimizer has to respect the possibility that any object that's not written to was originally const
and in read-only memory.
On a system that silently ignored attempts to write a read-only address, or only had read+write RAM, this wouldn't be a problem. But mainstream non-embedded platforms like x86-64 do have memory protection, and some embedded targets might fault on an attempt to store to ROM. It would still be C++ UB to write a const
object in the abstract machine, but the compiler could in theory invent writes of the value already present when generating asm for a system where it wouldn't fault, if other restrictions don't prevent it. If compiler devs actually wrote and maintained code to spend compile-time looking for this optimization, which is unlikely.
Thread safety (maybe ok in this case)
In general the compiler must not invent writes to objects the abstract machine doesn't write, in case another thread is also writing it and we'd step on the value. e.g. x.store(x.load())
can reset x
back to an earlier value, making another thread's x++
lose counts. (Except atomic RMWs are safe, like a compare-exchange that atomically stores a 0
only if the value was already 0
.)
Since we already read the object (and there's nothing between the read and potential write another thread could synchronize with), we can assume no other thread writes since that would be data-race UB with our unconditional read.
And in this case, seeing any other value would result in a store, so any stepping on values stored by other threads could equally well be explained by the if
running after the other store as well, and seeing a non-1
value, then deciding to store a 1
. (Unless there's a possible memory-ordering problem for an unconditional store? I think probably not in a race-free program, especially not when compiling for x86 with its strongly ordered memory model.)
I think thread-safety is not a real problem for this one case, assuming storing an int is done atomically1 in the asm so other threads will all read 1
in the no-UB case where there aren't any other writers that could overlap with this function's execution.
But in general, inventing non-atomic load + store back the same value has been a thread-safety problem for compilers in practice (e.g. I seem to recall reading that IA-64 GCC did that for bytes just past the end of an array for an odd-length memcpy
or bitfield or something, which was bad news when it was in a struct next to a uint8_t lock
.) So compiler devs are justifiably reluctant to invent stores.
- Crash with icc: can the compiler invent writes where none existed in the abstract machine? a real case of ICC inventing writes when auto-vectorizing (for a more normal conditional replacement loop), leading to crashes on string literals, as well as thread unsafety. This is/was a compiler bug, and the kind of problem that's solved by AVX-512 masked stores. (Or by writing the source like
arr[i] = arr[i] == x ? new : arr[i];
to unconditionally store something, in which case you of course can't call it on read-only memory, and lets the compiler know it doesn't have to worry about avoiding non-atomic RMWs in case of other threads. It can optimize away the stores by masking, but it can't invent new stores).
- https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/ part 2 of his talk mentions some real-world compilers bugs that have since been fixed, where compilers did invent writes in ways that violated the C++11 memory model and let to problems, like the IA-64 one I mentioned.
- LWN: Who's afraid of a big bad optimizing compiler? - an inventory of the things compilers can do on non-atomic non-volatile accesses, which could be a problem for rolling your own atomics (like the Linux kernel does) if you tried to skip
volatile
for the accesses. Invented stores are possible only for code paths that already definitely store to the object, but invented loads are always possible for actual objects or C++ references, although not pointer derefs. (C++ references aren't nullable, and I think can only be taken on valid objects, unlike pointers to one-past-the-end of an array. But John Bollinger points out that references can outlive the original object, becoming stale. In cases where that's a possibility, an invented load isn't safe if the pointed-to memory might have been unmapped by a system call.)
Note 1: atomicity
An atomic store is trivial in asm for most C++ implementations, which require int
to be aligned, and run on machines with registers and internal data paths at least as wide as int
. But atomicity isn't actually necessary in this case: storing 1 byte at a time is also fine. If there are no other writers, rewriting each byte with the value that was already there doesn't ever change the value. If there are other writers, there was UB in the C++ abstract machine, and we're just changing the symptoms. e.g. that the final result is 0x00ffffff
if another thread stored -1
after we stored 3 of 4 bytes.
What would be a problem is temporarily leaving a different value in memory, e.g. by clearing the whole thing to zero and then setting the low bit, as supercat suggested. That would be allowed for an assignment that really happened in the abstract machine. (But probably only plausible for a Deathstation 9000 compiler that's intentionally hostile and breaks code with UB in as many cases as possible. The opposite of real compilers that are designed with some thought to systems / kernel programming and hand-rolled atomics like the Linux kernel uses.)
Since the C++ variable isn't atomic<>
, we can't be breaking a release-sequence headed by a different thread. Nothing can legally do an acquire
load on a plain-int
in ISO C++. But in GNU C++, they could with __atomic_load_n(&x, __ATOMIC_ACQUIRE)
.
Performance reasons for respecting the source-code choice
If many threads are running this code on the same object, unconditional writes would be safe on normal CPU architectures, but much slower (contention for MESI exclusive ownership of the cache line, vs. shared.)
Dirtying a cache line is also something that might not be desirable.
(And safe only because they're all storing the same value. If even one thread was storing a different value, it might have that store overwritten if it happened to not be last in the modification order as determined by order of CPUs getting ownership of the cache line to commit their stores.)
This check-before-write idiom is actually a real thing that some multithreaded code will do to avoid cache-line ping-pong on variables that would be highly contended if every thread wrote the value that's already there:
Also related CPU-architecture considerations:
How does x86 handle store conditional instructions? (it doesn't, except with AVX or AVX-512 masked stores. This isn't very related, since you'd still have to read first to generate a condition. x86 cmpxchg
does an ==
not !=
compare. And with lock cmpxchg
to get an atomic RMW to be sure of no thread-safety problems, would always dirty the cache line.)
What specifically marks an x86 cache line as dirty - any write, or is an explicit change required? - silent-store optimizations in hardware could perhaps be the best of both worlds, perhaps not requiring the cache to even get exclusive ownership of the line when software does an unconditional store of the value that's already there. But no CPUs actually do silent-store optimizations to L1d that I know of. Some that did for L3 (Skylake and Ice Lake for stores of all-zero cache lines) have disabled it in microcode because of the potential for a data-dependent-timing side-channel, unfortunately.
Locks around memory manipulation via inline assembly - and discussion on Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? re: the fact that a pure read followed by a write might result in two off-core requests for the cache line: one to get it in MESI Shared state, and then a Read-For-Ownership (RFO) to get exclusive ownership. It's the same problem as with a spinlock or mutex if you start pessimistic and try not to disturb other cores as much by checking read-only first before trying a lock cmpxchg
or xchg
.
In this case, if you don't expect to avoid the store much of the time, you should just do it unconditionally so there's only the RFO from the write request, not an earlier share request. (This also avoids possible branch mispredicts, or on 32-bit ARM where a predicated store is possible, avoids stalling waiting for the load. The store buffer can decouple execution from cache-miss stores committing to cache.)