Does compiler allow to reorder stores before condition?

Question

assume we have the following code

int test(bool* flag, int* y)
{
    if(*y)
    {
        *flag = false;
    else
    {
        *flag = true;
    }
}

note that the compiler can prove here that writing to flag will always happen, so I think the following one is allowed(which I don't think is optimization at all, but just for the example)

int test(bool* flag, int* y)
{
    *flag = true;
    if(*y)
    {
        *flag = false;
    }
}

so now, we write true to flag also if y!=0, but from the point of the as-if rule, this looks valid.

but I still think, that this optimization is weird, assume that *y = true always, so the flag is always false, so if some other thread reads the flag variable, he may see true, although it should never happen, so does it break the as-if rule? is the optimization valid?

Addition: the case of non-atomic is clear since it's UB, and all bets are off, but if the flag is atomic with relaxed ordering, what then?

lets assume flag is atomic with store/load with relaxed ordering then. — Moshe Levy, May 01 '23 at 19:53
@user17732522 actually it's kind of pseudo-code, i meant to ask on both, you answered on the case of non-atomic, thanks for that, i will clarify it. — Moshe Levy, May 01 '23 at 19:59
If the compiler supports aliasing (as GCC and Clang do with `-fno-strict-aliasing`) or `bool` is a `typedef` for `int` (rather than the macro defined by ``), then the compiler must allow for the possibility that `flag` and `y` point to the same memory, so the transformation would not be allowed. — Eric Postpischil, May 01 '23 at 19:59
@EricPostpischil: Interesting point about reordering wrt. to the load. A compiler could still invent an extra store, though, like `tmp = *y;` then store / branch over another store. https://lwn.net/Articles/793253/#Invented%20Stores (*Who's afraid of a big bad optimizing compiler?*) - this is of course why we need `std::atomic<>` (or `volatile` for hand-rolled atomics like the Linux kernel uses). — Peter Cordes, May 01 '23 at 20:05

user17732522 · Accepted Answer · 2023-05-01T20:01:31.577

The transformation is valid (based only on what the standard(s) define, i.e. assuming strict aliasing).

It is impossible for any other thread to observe the intermediate value, because no thread is allowed to read *flag while the function executes on another thread.

*flag isn't an atomic object, the function always writes to *flag and the other thread reading it unsynchronized therefore causes undefined behavior as it is a data race regardless of the path taken.

If flag had type (std::)atomic_bool* instead, then, regardless of memory ordering used, the transformation wouldn't generally be valid, because, as you said, it would become possible for another thread to observe a value that it shouldn't have been able to observe.

score 1 · Answer 2 · answered May 01 '23 at 20:17

1

Given:

int test(bool* flag, int* y) {
  if(*y)

The following (slightly different from OP's code) is not allowed

// int test(bool* flag, int* y) {
int test(bool* flag, bool* y) {
// or 
int test(int* flag, int* y) {
  *flag = true;

... as pointers to the same type of data may point to the same piece of data: i.e. the pointers are the same. Then the code change is not functionally the same.

If code was:

int test(same_type* restrict flag, same_type* restrict y) {

Then the compiler can assume that the pointers do not point to overlapping data, even though they point to the same type.

answered May 01 '23 at 20:17

chux - Reinstate Monica

143,097
13
135
256

thanks, i didn't think on this case, but the main idea of the question is assuming that not the case, they do not point to the same memory – Moshe Levy May 01 '23 at 20:22
@MosheLevy Even if in your use case, pointers do not point to overlapping data, the compiler can only assume non-overlap when types differ or `restrict`. Since they differ `int test(bool* flag, int* y)`, the compiler is free to re-org as proposed. – chux - Reinstate Monica May 01 '23 at 20:26
1

@MosheLevy: regardless of aliasing, the compiler could still invent a store to an object that's already being stored to unconditionally. That's separate from reordering it with the load of `*y`. In this case, a C representation of possible asm is `int tmp = *y;` `flag = true;` `if (tmp) flag = false;`. See https://lwn.net/Articles/793253/#Invented%20Stores (*Who's afraid of a big bad optimizing compiler?*) - cases like this are one reason you can't safely use non-atomic non-volatile objects for inter-thread communication. (The Linux kernel hand-rolls its own atomics with volatile on GCC) – Peter Cordes May 01 '23 at 20:59
Also, the compiler could branch on `test == y`, and when they are unequal, execute a version that invents any number of loads and stores in any order. – Nate Eldredge May 01 '23 at 23:13
@PeterCordes thanks, so using volatile on the flag would be enough since this transformation will break the order of volatile members, am i right? – Moshe Levy May 02 '23 at 05:45
1

@MosheLevy: You could do that, but it doesn't gain you anything vs. `std::atomic` with `memory_order_relaxed`. And worse, it still has data-race undefined behaviour according to ISO C++. [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) (never, unless you're writing Linux kernel code.) If you want to do unsynchronized access to a shared variables from multiple threads, it should be `atomic<>`. – Peter Cordes May 02 '23 at 05:50

score 0 · Answer 3 · answered May 02 '23 at 08:48

0

int test(bool* flag, int* y)
{
    *flag = true;

As-is, it is dubious if this is an allowed optimization. The compiler is not allowed to optimize out or reorder what the standard calls side effects, defined in C17 5.1.2.3:

Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects ...
/--/
The presence of a sequence point between the evaluation of expressions A and B implies that every value computation and side effect associated with A is sequenced before every value computation and side effect associated with B.

In this case, a lvalue access to the object pointed at by flag modifies an object - it is a side effect. However, reading from *y is not a side effect, but writing the the same variable twice *flag = true; ... *flag = false; would introduce a new side effect which wasn't there in the code.

But this is an artificial example. An optimizing compiler wouldn't do such optimizations on a "C level". Instead it would likely use a CPU register as temporary variable, in case freeing one up while setting it to zero would somehow be more efficient. So it wouldn't write to the actual location flag but to a temporary register.

Even more likely, it would set a register to the value of the boolean expression *y != 0 then copy that register value 1 or 0 into *flag. Notably if I change your C code to this:

int test(bool* flag, int* y)
{
  *flag = *y != 0;
}

Then I get the very same machine code, given that optimizations are enabled. On gcc x86 -O3 it might look like:

mov     eax, DWORD PTR [rsi]
test    eax, eax
setne   BYTE PTR [rdi]
ret

That is, copy flag into register eax, then depending on if eax is set or not, set some status flag with test and store the result 1 or 0 into test depending on that result.

The main optimization to consider here is likely to generate "branch free" code, without any comparisons or conditional jumps etc, since that leads to efficient use of instruction cache on high-end CPUs such as x86.

Regarding multi-threading, none of this code is safe regardless of re-ordering, because a write to a C variable cannot be considered atomic unless explicitly qualified as such (_Atomic etc).

Regarding strict aliasing as brought up in other answers, it isn't really relevant here. A bool* and int* are non-compatible pointer types and may not alias. The compiler cannot assume that a write to *flag will change *y or vice versa.

answered May 02 '23 at 08:48

Lundin

195,001
40
254
396

thanks, what is exactly the side effect, write to flag a value that we will never write without optimization, or create additional write to flag? – Moshe Levy May 02 '23 at 08:54
@MosheLevy `*flag = ...` _in the original C code_ is a side effect. Compilers do not optimize code by reorganizing C code and that's the main flaw with the reasoning here. And besides I don't see how two writes to the same memory location could generate faster code on any CPU, so it is highly unlikely that some compiler will ever do this. – Lundin May 02 '23 at 08:59
this is just an example, i do not refer to any real optimization or why it is optimization at all. so what you basically mean, the compiler creates a new side effect here(write to shared data value) that wasn't in the first code in some cases, although the compiler is not allowed to create side effects while he do the optimization, am I right? – Moshe Levy May 02 '23 at 09:22
moreover, in this link https://lwn.net/ml/linux-kernel/20210604205600.GB4397@paulmck-ThinkPad-P17-Gen-1/ we see that kind of the same example is raised by linux guys, what do you think on that? – Moshe Levy May 02 '23 at 09:23
@MosheLevy That thread is about volatile object access so it isn't the same thing. Though notably there's been a recognized defect in the C language in the part I quoted, namely "Accessing a volatile object". Some bad compilers took this literally and optimized expressions that weren't accessing a named object, even though `volatile` was present - such as `( *(volatile uint8_t*)some_address ) = x;`. This has been fixed with upcoming C23, but until then beware. – Lundin May 02 '23 at 09:44
i know, but they still raised an example where the compiler decided to move y from the condition, can't we say the example is not allowed since that mean creating side effect? – Moshe Levy May 02 '23 at 09:49
@MosheLevy It's often debated. Reordering can actually happen on the CPU side and then the compiler is helpless, short of introducing memory barriers. But neither the compiler nor CPU are allowed to re-order instructions past a volatile access and here "volatile access" and side effect are different, because C says "Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine" but that text doesn't mention side effects. Generally, this part of the C language is under-specified and poorly worded. – Lundin May 02 '23 at 10:07
1

Your reasoning about modifications to non-volatile is ignoring the as-if rule. Compilers are allowed to invent extra stores to an object that's definitely being stored to in the abstract machine. It's not `volatile` so the side-effect isn't a guaranteed externally-visible property (in the asm or with a watchpoint), and it would be data-race UB to observe it from another thread within the C++ program. Of course in this case real compilers usually *won't* want to implement it that way, as you say if-conversion to branchless is likely. But that's separate from whether they're allowed. – Peter Cordes May 02 '23 at 13:26
@PeterCordes "Compilers are allowed to invent extra stores to an object that's definitely being stored to in the abstract machine" There's no clear section in the standard confirming/rejecting that statement. It only says _"An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced"_. Well there is a side effect in this code and it is getting evaluated. But nothing confirms or denies that the compiler/CPU is allowed to _re-order_ the execution of that side effect or add a second one etc. – Lundin May 02 '23 at 13:35
1

@Lundin from what you wrote, each writes to some shared memory is a side effect, so the following code x=2; x=3; a simple optimization will be x=3, then why the compiler is allowed not to evaluate x=2, doesn't it a side effect? – Moshe Levy May 02 '23 at 13:53
@MosheLevy That might sort under the "value not used" rule. – Lundin May 02 '23 at 13:54
@Lundin but we have and, so we must assure that there is no side effect also. – Moshe Levy May 02 '23 at 13:55
@PeterCordes but its written explicitly, "modifying an object is a side effect" what do i miss here? – Moshe Levy May 02 '23 at 13:57
1

@MosheLevy: Modifying an object is a side effect, but not a guaranteed-visible side effect. Your example of dead-store elimination is a good one. The section Lundin quoted is about the sequencing rules; even side effects that aren't guaranteed to be visible (after optimization per the as-if rule) need to have sequencing rules to determine what a program actually does. The lack of sequencing of side-effects in `tmp = ++i + ++i` is a classic example of undefined behaviour, because the abstract machine doesn't define an order for the writes (side effects) vs. the reads. – Peter Cordes May 02 '23 at 15:00
1

@MosheLevy: Also, `putchar('a') + putchar('b')` doesn't guarantee which order those side-effects happen. But unlike modifying a non-volatile object, putchar is a *visible* side effect, visible from outside the thread. That's why both putchar calls have to happen (or an equivalent fputs of "ab" or "ba"). The as-if rule does allow a compiler to implement `a+=2` as two increments by 1, for example, because a non-volatile variable `a` can't be observed (without data-race UB) at the same time this thread is modifying it. And a program with UB can do literally anything. – Peter Cordes May 02 '23 at 15:04
@PeterCordes so we have that writing to memory is a side effect, and we have this also:" An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced", from this, in order to do optimization we must make sure there is no side effect, so how the example of x=3 from above is allowed? or the meaning here is for visible side effect? – Moshe Levy May 02 '23 at 16:36
1

@MosheLevy: *so we have that writing to memory is a side effect* - Only in the abstract machine. It doesn't have to correspond to a store in asm after optimization, unless the store is to a `volatile` object. Otherwise the only requirement is that the whole program produce the right output. The way normal compilers work, that means a memory location has to have the right value if/when some other code might read it, e.g. a global variable has to have the right value when the compiler calls a function that it didn't or couldn't inline, because that function *might* read the global. – Peter Cordes May 02 '23 at 17:20
1

@MosheLevy: But between interactions with "black box" code (stuff the optimizer can't see), it certainly can keep global variables in registers and do fewer stores. e.g. https://godbolt.org/z/68Tdxd3e6 . Or if it wants, do more stores along some paths of execution where there's at least one store. (It must not invent writes to objects the abstract machine doesn't write at all, that wouldn't be thread-safe. e.g. `if (y) { x = 1;}` can't be optimized to `x = y ? 1 : x;` because that could step on a store from another thread.) – Peter Cordes May 02 '23 at 17:29
2

@MosheLevy: Maybe another way to put it, inventing an extra store to an object in the asm isn't inventing a "side effect" in the C++ abstract machine, it's just implementing it differently per the as-if rule. Lundin's answer is wrong about that. – Peter Cordes May 02 '23 at 17:31
@Lundin: Suppose I do `x = val;` where `x` is a type so large that the target machine cannot store it in a single instruction. So necessarily multiple store instructions must be used, and there will be tearing, in that `x` will temporarily contain a value that is neither the previous nor the new value. I presume you do not consider that to be "introducing a new side effect". [...] – Nate Eldredge May 04 '23 at 01:10
On the other hand, [here](https://stackoverflow.com/a/71867102/634919) you can see an example where `x` is small enough that it *could* have been written with one (atomic) store instruction, but the compiler decided it was more efficient to use two instructions instead. Isn't that also legitimate? Why should the mere size of the object make a difference? Then, I fail to see any practical difference between using two stores to write the two halves of an object, and the two stores in the example at hand. – Nate Eldredge May 04 '23 at 01:14
Or, for another example, x86-32 *can* store a 64-bit object in one instruction (using SSE or x87 registers or `cmpxchg8b`), but it is much more efficient to do it as two 32-bit stores. – Nate Eldredge May 04 '23 at 01:32

Does compiler allow to reorder stores before condition?

3 Answers3