Does compiler need to care about other threads during optimizations?

Question

This is a spin-off from a discussion about C# thread safety guarantees.

I had the following presupposition:

in absence of thread-aware primitives (mutexes, std::atomic* etc., let's exclude volatile as well for simplicity) a valid C++ compiler may do any kinds of transformations, including introducing reads from the memory (or e. g. writes if it wants to), if the semantics of the code in the current thread (that is, output and [excluded in this question] volatile accesses) remain the same from the current thread's point of view, that is, disregarding existence of other threads. The fact that introducing reads/writes may change other thread's behavior (e. g. because the other threads read the data without proper synchronization or performing other kinds of UB) can be totally ignored by a standard-conform compiler.

Is this presupposition correct or not? I would expect this to follow from the as-if rule. (I believe it is, but other people seem to disagree with me.) If possible, please include the appropriate normative references.

That quote is just musings on the as-if rule in multithreaded programs. There's no new information here. — HolyBlackCat, Jun 02 '22 at 17:55
Other threads "reading without proper synchronization or performing other kinds of UB" is UB. Undefined behavior is undefined. — DevSolar, Jun 02 '22 at 18:02
See __Threads and data races__ here https://en.cppreference.com/w/cpp/language/memory_model — Richard Critten, Jun 02 '22 at 18:47
And also __Multi-threaded executions and data races__ [C++ Draft Standard intro.multithread](https://eel.is/c++draft/intro.multithread) and [C++ Draft Standard intro.races-21](https://eel.is/c++draft/intro.multithread#intro.races-21) _"...The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior...."_ — Richard Critten, Jun 02 '22 at 18:55
@HolyBlackCat: Well, as some people disagree with my presupposition, it's possible that it doesn't just simply follow. Or maybe it does, so the answer should be easy and trivial. — Vlad, Jun 02 '22 at 19:51
@DevSolar: I see, but how can I draw the formal conclusion "any optimization (including reading the memory locations not in direct correspondence with the program source) is valid"? — Vlad, Jun 02 '22 at 19:54
@RichardCritten: This doesn't seem to answer the question _directly_. Is there an easy way to prove formally the presupposition in question based on your reference? — Vlad, Jun 02 '22 at 19:59
__Axioms__: A program with a data-race results in Undefined Behaviour. A conforming program may not contain Undefined Behaviour. In a multi-threaded program you need to use synchronisation primitives to avoid data-races and Undefined Behaviour. __Theorem__: If there are no synchronisation primitives either (a) there are no data-races so the optimizer can optimizer __as-if__ the program is single threaded or (b) there is a data-race which results in Undefined Behaviour, and the program is therefore a non-conforming so the optimizer can do what ever it likes. — Richard Critten, Jun 02 '22 at 20:13

Peter Cordes · Accepted Answer · 2022-06-07T18:04:04.707

Yes, C++ defines data race UB as potentially-concurrent access to non-atomic objects when not all the accesses are reads. Another recent Q&A quotes the standard, including.

[intro.races]/2 - Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.

[intro.races]/21 ... The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...

Any such data race results in undefined behavior.

That gives the compiler freedom to optimize code in ways that preserve the behaviour of the thread executing a function, but not what other threads (or a debugger) might see if they go looking at things they're not supposed to. (i.e. data race UB means that the order of reading/writing non-atomic variables is not part of the observable behaviour an optimizer has to preserve.)

introducing reads/writes may change other thread's behavior

The as-if rule allows you to invent reads, but no you can't invent writes to objects this thread didn't already write. That's why if(a[i] > 10) a[i] = 10; is different from a[i] = a[i]>10 ? 10 : a[i].

It's legal for two different threads to write a[1] and a[2] at the same time, and one thread loading a[0..3] and then storing back some modified and some unmodified elements could step on the store by the thread that wrote a[2].

Crash with icc: can the compiler invent writes where none existed in the abstract machine? is a detailed look at a compiler bug where ICC did that when auto-vectorizing with SIMD blends. Including links to Herb Sutter's atomic weapons talk where he discusses the fact that compilers must not invent writes.

By contrast, AVX-512 masking and AVX vmaskmovps etc, like ARM SVE and RISC-V vector extensions I think, do have proper masking with fault suppression to actually not store at all to some SIMD elements, without branching.

When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements? AVX-512 masking does indeed do fault-suppression for read-only or unmapped pages that masked-off elements extend into.
AVX-512 and Branching - auto-vectorizing with stores inside an if() vs. branchless.

It's legal to invent atomic RMWs (except without the Modify part), e.g. an 8-byte lock cmpxchg [rcx], rdx if you want to modify some of the bytes in that region. But in practice that's more costly than just storing modified bytes individually so compilers don't do that.

Of course a function that does unconditionally write a[2] can write it multiple times, and with different temporary values before eventually updating it to the final value. (Probably only a Deathstation 9000 would invent different-valued temporary contents, like turning a[2] = 3 into a[2] = 2; a[2]++;)

For more about what compilers can legally do, see Who's afraid of a big bad optimizing compiler? on LWN. The context for that article is Linux kernel development, where they rely on GCC to go beyond the ISO C standard and actually behave in sane ways that make it possible to roll their own atomics with volatile int* and inline asm. It explains many of the practical dangers of reading or writing a non-atomic shared variable.

Out of pure curiosity, isn't it allowed to create a completely new object, write to it and abandon it (e. g. on stack or even on heap provided that the object will deleted at the end?) Or modify the value of some global object but restore it back afterwards? Isn't it all the same from the current thread's point of view? (Maybe the standard mentions this scenario?) — Vlad, Jun 03 '22 at 05:39
Thank you for the link, it states some problems I was afraid of and some I wasn't aware of. — Vlad, Jun 03 '22 at 05:48
@Vlad: yes, sure, clang invents a return-value temporary on the stack in debug builds. [Why is 0 moved to stack when using return value?](https://stackoverflow.com/q/31149806) But the "the heap" isn't a monolithic thing in mainstream C++ compilers. Inventing calls to `new`/`delete` could count as a visible side-effect if some other compilation unit has overridden `operator new`. (This is an obstacle for efficient implementation of `std::vector` to use realloc (or a hypothetical try-realloc for non-trivially-copyable types), ... — Peter Cordes, Jun 03 '22 at 06:06
... instead of actually doing the stupid standard-mandated behaviour of actually allocating separate space and copying. This is pretty bad for huge `std::vector`s where Linux `mremap` would have been an option, to allocate new contiguous pages, or remap the existing physical pages to a new virtual address where there's room for more pages after. If your vector is gigabytes in size, the resulting page table updates and TLB misses are cheaper than any copying would have been.) — Peter Cordes, Jun 03 '22 at 06:07
@Vlad: A "C++ implementation" that's suitable for low-level systems programming (which all the major mainstream ones aim to be) must go *way* beyond the ISO C++ standard in terms of defining what counts as visible behaviour when it comes to the standard library. Inventing calls to memcpy and memset is something compilers do in practice, but not things that might result in system calls (like `mmap` or `VirtualAlloc`). — Peter Cordes, Jun 03 '22 at 06:09
Accepted the answer, but: could you perhaps add the pointers to the standard into the answer for the future reference? — Vlad, Jun 07 '22 at 07:59
@Vlad: Sure, quoted [intro.races], and added links to some Q&As about actual compiler behaviour when auto-vectorizing code with conditional stores vs. always storing a conditionally-determined value. — Peter Cordes, Jun 07 '22 at 18:05

Does compiler need to care about other threads during optimizations?

1 Answers1

Linked