Why do none of the major compilers optimize this conditional store that checks if the value is already set?

Question

I stumbled across this Reddit post which is a joke on the following code snippet,

void f(int& x) {
    if (x != 1) {
        x = 1;
    }
}
void g(int& x) {
    x = 1;
}

saying that the two functions are not equivalent to 'the compiler'. I was convinced that any of the major C++ compilers would optimize the conditional assignment away to an unconditional store, thus emitting the same assembly code for f and g.

However, they don't.

Can anyone explain to me why this is the case?

What I am thinking is this: the unconditional store would most likely be faster because we have to access memory anyway to read the value for the comparison, and the branching code puts stress on the branch predictor. Also, stores should not be considered side effects by the compiler (AFAIK), even though subsequent memory accesses might be faster or slower depending on whether the branch in f was taken or not due to cache locality.

So are the compilers just not able to figure this out? While the equivalence of f and g might not necessarily be trivial to prove, I feel like there are much harder problems that these compilers are able to solve. So am I maybe wrong and these functions are not equivalent after all, or what is going on here?

@PeterCordes I see, so I guess the store would introduce a data race, if other threads read from the address, even if the store just overrides the same value? But am I right in thinking, that if the store would introduce a data race, this code has a data race anyway as long as the value of `x` is ever not equal to 1? But even if so, I understand the compiler has to assume that this is the case I guess. — chrysante, May 16 '23 at 09:50
C++ data-race UB is something that applies in the C++ abstract machine. It doesn't apply to writes invented by the compiler applying the as-if rule. Normal hardware doesn't have race-detection built in, so the closest you'll come is `clang -fsanitize=thread`. Writing the same value back is a problem because it could conflict with a store of a *different* value by a different thread. — Peter Cordes, May 16 '23 at 09:56
@chrysante: If nothing ever sets `x` to a value other than 1, an arbitrary number of functions could call `f()` without a data race since `x` would never be written. On some platforms, however, the most efficient way to perform `x=1` might be to clear `x` and then set the lowest bit, which could create a data race where none would otherwise exist if some other code were to read `x` while it was zero. — supercat, May 16 '23 at 18:13
I suppose it makes a difference if operating system has marked a page as _copy-on-write_. Or if the platform is making use of memory-mapped IO. — Wyck, May 17 '23 at 15:34
If anyone was wondering, the compilers are very capable of optimizing similar problems were the transformation doesn't add new stores: https://godbolt.org/z/6ME6Ya7TY — chrysante, May 17 '23 at 17:34
@PeterCordes Is the "!=" in the title intentional? It seems wrong to me — Fabio says Reinstate Monica, May 18 '23 at 13:27
@FabiosaysReinstateMonica: It felt awkward to me while writing it, but I couldn't come up with anything better that was still short and fairly specific. The idea was that it's a conditional assignment of value `y` (`1` in this case), and the condition for doing the store is `x != y`, i.e. not (already) equal to the value to store. `!= y` is "!= (the value to assign)". If you have any ideas for better phrasing, that would be great. — Peter Cordes, May 18 '23 at 20:39
@PeterCordes my best proposals are "Why do none of the major compilers optimize this assignment that checks if the value is already set?" or "Assignment of a value only if the variable doesn't already have that value: why don't compilers optimize the check away?" — Fabio says Reinstate Monica, May 19 '23 at 09:17

Peter Cordes · Accepted Answer · 2023-05-18T21:13:31.270

The object might be `const`

It wouldn't be safe for static const int val = 1; living in read-only memory. The unconditional-store version will segfault trying to write to read-only memory.

The version that checks first is safe to call on that object in the C++ abstract machine (via const_cast), so the optimizer has to respect the possibility that any object that's not written to was originally const and in read-only memory.

Is it allowed to cast away const on a const-defined object as long as it is not actually modified? cites the standard to support this.

On a system that silently ignored attempts to write a read-only address, or only had read+write RAM, this wouldn't be a problem. But mainstream non-embedded platforms like x86-64 do have memory protection, and some embedded targets might fault on an attempt to store to ROM. It would still be C++ UB to write a const object in the abstract machine, but the compiler could in theory invent writes of the value already present when generating asm for a system where it wouldn't fault, if other restrictions don't prevent it. If compiler devs actually wrote and maintained code to spend compile-time looking for this optimization, which is unlikely.

Thread safety (maybe ok in this case)

In general the compiler must not invent writes to objects the abstract machine doesn't write, in case another thread is also writing it and we'd step on the value. e.g. x.store(x.load()) can reset x back to an earlier value, making another thread's x++ lose counts. (Except atomic RMWs are safe, like a compare-exchange that atomically stores a 0 only if the value was already 0.)

Since we already read the object (and there's nothing between the read and potential write another thread could synchronize with), we can assume no other thread writes since that would be data-race UB with our unconditional read.

And in this case, seeing any other value would result in a store, so any stepping on values stored by other threads could equally well be explained by the if running after the other store as well, and seeing a non-1 value, then deciding to store a 1. (Unless there's a possible memory-ordering problem for an unconditional store? I think probably not in a race-free program, especially not when compiling for x86 with its strongly ordered memory model.)

I think thread-safety is not a real problem for this one case, assuming storing an int is done atomically¹ in the asm so other threads will all read 1 in the no-UB case where there aren't any other writers that could overlap with this function's execution.

But in general, inventing non-atomic load + store back the same value has been a thread-safety problem for compilers in practice (e.g. I seem to recall reading that IA-64 GCC did that for bytes just past the end of an array for an odd-length memcpy or bitfield or something, which was bad news when it was in a struct next to a uint8_t lock.) So compiler devs are justifiably reluctant to invent stores.

Crash with icc: can the compiler invent writes where none existed in the abstract machine? a real case of ICC inventing writes when auto-vectorizing (for a more normal conditional replacement loop), leading to crashes on string literals, as well as thread unsafety. This is/was a compiler bug, and the kind of problem that's solved by AVX-512 masked stores. (Or by writing the source like arr[i] = arr[i] == x ? new : arr[i]; to unconditionally store something, in which case you of course can't call it on read-only memory, and lets the compiler know it doesn't have to worry about avoiding non-atomic RMWs in case of other threads. It can optimize away the stores by masking, but it can't invent new stores).
https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/ part 2 of his talk mentions some real-world compilers bugs that have since been fixed, where compilers did invent writes in ways that violated the C++11 memory model and let to problems, like the IA-64 one I mentioned.
LWN: Who's afraid of a big bad optimizing compiler? - an inventory of the things compilers can do on non-atomic non-volatile accesses, which could be a problem for rolling your own atomics (like the Linux kernel does) if you tried to skip volatile for the accesses. Invented stores are possible only for code paths that already definitely store to the object, but invented loads are always possible for actual objects or C++ references, although not pointer derefs. (C++ references aren't nullable, and I think can only be taken on valid objects, unlike pointers to one-past-the-end of an array. But John Bollinger points out that references can outlive the original object, becoming stale. In cases where that's a possibility, an invented load isn't safe if the pointed-to memory might have been unmapped by a system call.)

Note 1: atomicity
An atomic store is trivial in asm for most C++ implementations, which require int to be aligned, and run on machines with registers and internal data paths at least as wide as int. But atomicity isn't actually necessary in this case: storing 1 byte at a time is also fine. If there are no other writers, rewriting each byte with the value that was already there doesn't ever change the value. If there are other writers, there was UB in the C++ abstract machine, and we're just changing the symptoms. e.g. that the final result is 0x00ffffff if another thread stored -1 after we stored 3 of 4 bytes.

What would be a problem is temporarily leaving a different value in memory, e.g. by clearing the whole thing to zero and then setting the low bit, as supercat suggested. That would be allowed for an assignment that really happened in the abstract machine. (But probably only plausible for a Deathstation 9000 compiler that's intentionally hostile and breaks code with UB in as many cases as possible. The opposite of real compilers that are designed with some thought to systems / kernel programming and hand-rolled atomics like the Linux kernel uses.)

Since the C++ variable isn't atomic<>, we can't be breaking a release-sequence headed by a different thread. Nothing can legally do an acquire load on a plain-int in ISO C++. But in GNU C++, they could with __atomic_load_n(&x, __ATOMIC_ACQUIRE).

Performance reasons for respecting the source-code choice

If many threads are running this code on the same object, unconditional writes would be safe on normal CPU architectures, but much slower (contention for MESI exclusive ownership of the cache line, vs. shared.)

Dirtying a cache line is also something that might not be desirable.

(And safe only because they're all storing the same value. If even one thread was storing a different value, it might have that store overwritten if it happened to not be last in the modification order as determined by order of CPUs getting ownership of the cache line to commit their stores.)

This check-before-write idiom is actually a real thing that some multithreaded code will do to avoid cache-line ping-pong on variables that would be highly contended if every thread wrote the value that's already there:

Also related CPU-architecture considerations:

How does x86 handle store conditional instructions? (it doesn't, except with AVX or AVX-512 masked stores. This isn't very related, since you'd still have to read first to generate a condition. x86 cmpxchg does an == not != compare. And with lock cmpxchg to get an atomic RMW to be sure of no thread-safety problems, would always dirty the cache line.)
What specifically marks an x86 cache line as dirty - any write, or is an explicit change required? - silent-store optimizations in hardware could perhaps be the best of both worlds, perhaps not requiring the cache to even get exclusive ownership of the line when software does an unconditional store of the value that's already there. But no CPUs actually do silent-store optimizations to L1d that I know of. Some that did for L3 (Skylake and Ice Lake for stores of all-zero cache lines) have disabled it in microcode because of the potential for a data-dependent-timing side-channel, unfortunately.
Locks around memory manipulation via inline assembly - and discussion on Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? re: the fact that a pure read followed by a write might result in two off-core requests for the cache line: one to get it in MESI Shared state, and then a Read-For-Ownership (RFO) to get exclusive ownership. It's the same problem as with a spinlock or mutex if you start pessimistic and try not to disturb other cores as much by checking read-only first before trying a lock cmpxchg or xchg.

In this case, if you don't expect to avoid the store much of the time, you should just do it unconditionally so there's only the RFO from the write request, not an earlier share request. (This also avoids possible branch mispredicts, or on 32-bit ARM where a predicated store is possible, avoids stalling waiting for the load. The store buffer can decouple execution from cache-miss stores committing to cache.)

@463035818_is_not_a_number: Correct me if I'm wrong, but you could take the address of a `const int `and `reinterpret_cast` or `const_cast` + deref to call this function on a const object, right? Or is there already UB in there somewhere? — Peter Cordes, May 16 '23 at 09:54
@PeterCordes you can cast const away but then writing to it is UB. Maybe I just didnt understand the answer, have to read again — 463035818_is_not_an_ai, May 16 '23 at 09:55
@463035818_is_not_a_number the point there, that when you pass variable, which equals to 1, there is no write performed, so it would be safe to pass a reference to constant — Revolver_Ocelot, May 16 '23 at 09:57
@463035818_is_not_a_number: Yes, that's exactly the problem. You can cast `const` away, then it's safe to pass the object to `f()` but not `g()`. Because `f()` *won't* write to an `int` whose value is already `1`, thus safe in the C++ abstract machine, so it's a possibility the as-if rule must account for when optimizing. — Peter Cordes, May 16 '23 at 09:58
@chrysante: That reminds me, yes, checking before storing is a real thing that some multithreaded code will do for some shared variables. Added some links to Q&As about that. And yes, quite non-trivial to decide whether it's a worthwhile optimization, which depends on CPU tradeoffs and amount of contention. — Peter Cordes, May 16 '23 at 10:08
`segfault trying to write to read-only memory.` < No, it's UB, not segfault. There are architectures that don't care where you write (think Atmel etc). — Raf, May 17 '23 at 08:01
@Raf: The OP was asking about compilers targeting x86-64, where that will be the practical result of the UB. The example situation I was talking about was a value in read-only memory; if you consider systems without read-only memory, that case isn't possible. For architectures without memory protection, and without faults for ROM, would have different or maybe no symptoms for the UB. A compiler targeting a single-threaded environment where ROM silently discards writes without faulting might be able to do this "optimization". — Peter Cordes, May 17 '23 at 08:03
How many years on C++, Assembler and CPU architectures does it need to get to this level? — Thomas Weller, May 17 '23 at 11:07
@ThomasWeller not so much how many years, but what kind of experience you'd need to touch upon. Fresh out of school if your job would be developing a compiler, I would say quite quickly <2 years, if the person is very keen to learn. Others who do high level programing might enjoy the privilege of not touching assembly for even 10+ years. — Raf, May 17 '23 at 11:43
@Raf: Further, even on a platform like x64, a compiler may impose requirements upon the linker and execution environments. If a compiler specifies that its output is only suitable for use in environments that configure all data storage is configured as writable, then this kind of optimization would be sound. — supercat, May 17 '23 at 15:47
I don't think invented loads on C++ references can be presumed safe. Although references are not nullable and can be obtained only for valid objects, they can outlive their referents to become stale. — John Bollinger, May 17 '23 at 21:28
@JohnBollinger: Thanks, added a note about that. It would have to actually fault to be a problem; the invented-load's value would go unused if the abstract machine doesn't read the value. So that means only cases like memory handed back to the OS, e.g. via `munmap` inside a `delete`. Or OSes that reclaim stack space below the stack pointer, something which Linux for example doesn't currently do. That could make it unsafe to load a reference that was taken on a local var. — Peter Cordes, May 17 '23 at 22:02
Gcc's -fallow-store-data-races looks relevant, although I don't think it is enough to transform this particular example. — Marc Glisse, May 17 '23 at 22:15
@PeterCordes somewhat unrelated to the exact question, but if you run into any missed optimizations in LLVM, can you at me (or email me). I'll look into it. — Noah, Jun 22 '23 at 17:22

Mike Nakis · Answer 2 · 2023-05-16T09:54:24.707

7

Whether this would constitute an optimization or not depends on how often x is non-1, which is something that the C++ compiler does not know in advance. If x is almost always 1, then if( x != 1 ) return will probably be faster than x = 1.

(Interestingly enough, some virtual machines such as The Java Virtual Machine do analyze execution patterns during runtime, and perform such optimizations on the fly, and they may even undo such optimizations if it turns out that their assumptions were wrong, so they can, in theory, outperform C++ in certain marginal cases, if we are to believe that the overhead of analyzing the execution patterns at runtime is smaller than the overhead they save. I do not really know. I just find it very amusing that they do this.)

edited May 16 '23 at 09:54

answered May 16 '23 at 09:48

Mike Nakis

56,297
11
110
142

I doubt `if( x != 1 ) return` is faster than `x = 1` in practice. The CPU can always "fire and forget" the latter whereas the former often requires waiting >100 cycles to read from RAM. Best case scenario the branch predictor guesses correctly and both lines end up being equally fast. – Navin May 17 '23 at 00:07
1

@Navin: The scenario where this is useful is with multiple threads reading a value, so it can stay in MESI "shared" state in L1d cache of multiple cores. But if any of them write, it has to invalidate the caches of the other cores. (Of course, in C++ it would be data-race UB for multiple threads to write + write or write + read a non-atomic variables. Which is perhaps why Mike is mentioning Java, where it is safe; you just don't get any strong visibility or ordering guarantees without other synchronization.) – Peter Cordes May 17 '23 at 06:29
1

@Navin: It's an obscure corner case; you're right that normally [the store buffer](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram) can decouple stores from execution ("fire and forget"), so if the line isn't hot in cache, storing to it can be less bad than a correctly-predicted branch on it. The branch can't retire from the ROB until the load arrives, but the store buffer entry doesn't have to commit until after that (and in fact can't, until after retirement = non-speculative). – Peter Cordes May 17 '23 at 06:31
`x=1` will unconditionally generate a write transaction to RAM (unless the cache architecture can optimize away a re-write of the same value). Testing has a good chance of merely reading from cache. IO bandwidth on a RAM channel may be a critical resource, especially where many cores share few RAM channels. – nigel222 May 18 '23 at 09:01
1

@nigel222: All modern systems use write-back caches, so there's only a RAM write if/when the cache line is eventually evicted. For a frequently-updated value, cache should absorbs most of that write bandwidth. – Peter Cordes May 18 '23 at 21:16

score 6 · Answer 3 · answered May 17 '23 at 17:11

6

To me the most obvious answer is that this optimization is not worth the effort to implement. This is not a code-pattern that occurs frequently and the gain of performing the optimization is too small. When writing a compiler there is always a trade off regarding which optimizations to implement. Adding an optimization takes time and adds to the complexity of the code and doing this for something that rarely occurs in "real life code" or where the gain is very small is simply a waste of time. Something like this will only be optimized if it falls out naturally from a more general optimization.

answered May 17 '23 at 17:11

Johan

3,667
6
20
25

2

Yeah, 100%, even for a target where this optimization could be legal in theory, like perhaps Rust where you can't have a mutable reference to a const object, I wouldn't expect compilers to actually do it. I made a similar argument in [a comment](https://www.reddit.com/r/ProgrammerHumor/comments/13evurp/comment/jkh6vov/?utm_source=share&utm_medium=web2x&context=3) on the reddit thread that inspired this question. – Peter Cordes May 17 '23 at 20:02
`if (!found) found = true;` has got to be a common pattern, and this is very similar. – Aykhan Hagverdili May 19 '23 at 08:55
@AyxanHaqverdili Actually, eventhough I have looked at quite a bit of code in my days I have seen very few instances of that pattern. The typical pattern that resembles this the most is initialization guards `if (!initialized) { initialized = true; InitializeStuff(); }` and this pattern will not benefit from the transformation as `InitializeStuff()` needs to be done conditionally anyway. – Johan May 22 '23 at 11:06
@Johan Don't you ever do something to all items that match a certain predicate, and want to check if at least one item changed, so there comes the `bool atLeastOneFound`? – Aykhan Hagverdili May 22 '23 at 12:38
@AyxanHaqverdili Those are usually of the form `if (my_predicate(element)) { found = true; }` or `if(my_predicate(element)) { found = true; do_stuff(element); }` – Johan May 22 '23 at 15:06

supercat · Answer 4 · 2023-06-09T19:42:51.863

1

A compiler could specify that its output must be linked and used in such a fashion that all objects that may be accessed from within code--including objects marked const, will be in storage which is configured so that an arbitrary number of actions that write the same value will have no effect once any such action has been performed. The indicated optimizing transform would be sound on a compiler that documented such a limitation on how its output may be used.

Most compiler vendors, however, would probably regard the value of having generated code be usable in a wider range of contexts as being higher than the value of any additional optimizations that might be achievable as a result of such restrictions, especially since reliance upon such restrictions would prevent code from being interoperable with code processed by other implementations or environments that are not designed to uphold them.

edited Jun 09 '23 at 19:42

answered May 17 '23 at 15:38

supercat

77,689
9
166
211

Per [discussion on the original Reddit thread](https://www.reddit.com/r/ProgrammerHumor/comments/13evurp/comment/jkgr4sl/?utm_source=share&utm_medium=web2x&context=3), in Rust you can't legally get a mutable reference to a `const` object, so this optimization could be legal for Rust compilers. (Unless Rust's memory model allows simultaneous read+write of non-atomic objects, in which case reading the object doesn't imply that no other thread are writing unsynchronized.) Of course they're still don't actually do it in practice, for various reasons. – Peter Cordes May 17 '23 at 20:31
@PeterCordes: In rust, how would one handle the concept of a function receiving a callback and a reference it should pass to that callback, with the semantics that things should work if either the reference identifies a mutable object, or the callback refrains from using it to modify the object? Such a thing would only be possible in an "unsafe" context, but there's a difference between things which can only be done legally, but only in "unsafe" blocks, and things which can't be done via any recognized means. – supercat May 17 '23 at 22:34
I don't know Rust well enough, unfortunately. (Re: my earlier comment, apparently data races are UB (https://doc.rust-lang.org/reference/behavior-considered-undefined.html), so the compiler can assume that no writers exist for an object it reads.) – Peter Cordes May 17 '23 at 22:40
Languages in which data races are UB are unsuitable for many systems programming tasks where privileged code needs to access data owned by unprivileged code, unless there are modes or extensions which would constrain the behaviors (if unprivileged code e.g. modifies a buffer while privileged code is outputting it, it's fine if that results in the privileged code writing unpredictable bit patterns that the unprivileged code could have asked the privileged code to write, but it must not be able to trigger other side effects within the privileged code.) – supercat May 18 '23 at 06:12

Why do none of the major compilers optimize this conditional store that checks if the value is already set?

4 Answers4

The object might be `const`

Thread safety (maybe ok in this case)

Performance reasons for respecting the source-code choice

Linked

Related

Why do none of the major compilers optimize this conditional store that checks if the value is already set?

4 Answers4

The object might be const

Thread safety (maybe ok in this case)

Performance reasons for respecting the source-code choice

Linked

Related

The object might be `const`