If a RMW operation changes nothing, can it be optimized away, for all memory orders?

Question

In the C/C++ memory model, can a compiler just combine and then remove redundant/NOP atomic modification operations, such as:

x++,
x--;

or even simply

x+=0; // return value is ignored

For an atomic scalar x?

Does that hold for sequential consistency or just weaker memory orders?

(Note: For weaker memory orders that still do something; for relaxed, there is no real question here. EDIT AGAIN: No actually there is a serious question in that special case. See my own answer. Not even relaxed is cleared for removal.)

EDIT:

The question is not about code generation for a particular access: if I wanted to see two lock add generated on Intel for the first example, I would have made x volatile.

The question is whether these C/C++ instructions have any impact what so ever: can the compiler just filter and remove these nul operations (that are not relaxed order operations), as a sort of source to source transformation? (or abstract tree to abstract tree transformation, perhaps in the compiler "front end")

EDIT 2:

Summary of the hypotheses:

not all operations are relaxed
nothing is volatile
atomic objects are really potentially accessible by multiple functions and threads (no automatic atomic whose address isn't shared)

Optional hypothesis:

If you want, you may assume that the address of the atomic so not taken, that all accesses are by name, and that all accesses have a property:

That no access of that variable, anywhere, has a relaxed load/store element: all load operations should have acquire and all stores should have release (so all RMW should be at least acq_rel).
Or, that for those accesses that are relaxed, the access code doesn't read the value for a purpose other than changing it: a relaxed RMW does not conserve the value further (and does not test the value to decide what to do next). In other words, no data or control dependency on the value of the atomic object unless the load has an acquire.
Or that all accesses of the atomic are sequentially consistent.

That is I'm especially curious about these (I believe quite common) use cases.

Note: an access is not considered "completely relaxed" even if it's done with a relaxed memory order, when the code makes sure observers have the same memory visibility, so this is considered valid for (1) and (2):

atomic_thread_fence(std::memory_order_release);
x.store(1,std::memory_order_relaxed);

as the memory visibility is at least as good as with just x.store(1,std::memory_order_release);

This is considered valid for (1) and (2):

int v = x.load(std::memory_order_relaxed);
atomic_thread_fence(std::memory_order_acquire);

for the same reason.

This is stupidly, trivially valid for (2) (i is just an int)

i=x.load(std::memory_order_relaxed),i=0; // useless

as no information from a relaxed operation was kept.

This is valid for (2):

(void)x.fetch_add(1, std::memory_order_relaxed);

This is not valid for (2):

if (x.load(std::memory_order_relaxed))
  f();
else
  g();

as a consequential decision was based on a relaxed load, neither is

i += x.fetch_add(1, std::memory_order_release);

Note: (2) covers one of the most common uses of an atomic, the thread safe reference counter. (CORRECTION: It isn't clear that all thread safe counters technically fit the description as acquire can be done only on 0 post decrement, and then a decision was taken based on counter>0 without an acquire; a decision to not do something but still...)

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/201804/discussion-on-question-by-curiousguy-if-a-rmw-operation-changes-nothing-can-it). — Samuel Liew, Nov 03 '19 at 22:58

Peter Cordes · Answer 1 · 2019-11-02T06:55:31.733

No, definitely not entirely. It's at least a memory barrier within the thread for stronger memory orders.

For mo_relaxed atomics, yes I think it could in theory be optimized away completely, as if it wasn't there in the source. It's equivalent for a thread to simply not be part of a release-sequence it might have been part of.

If you used the result of the fetch_add(0, mo_relaxed), then I think collapsing them together and just doing a load instead of an RMW of 0 might not be exactly equivalent. Barriers in this thread surrounding the relaxed RMW still have an effect on all operations, including ordering the relaxed operation wrt. non-atomic operations. With a load+store tied together as an atomic RMW, things that order stores could order an atomic RMW when they wouldn't have ordered a pure load.

But I don't think any C++ ordering is like that: mo_release stores order earlier loads and stores, and atomic_thread_fence(mo_release) is like an asm StoreStore + LoadStore barrier. (Preshing on fences). So yes, given that any C++-imposed ordering would also apply to a relaxed load equally to a relaxed RMW, I think int tmp = shared.fetch_add(0, mo_relaxed) could be optimized to just a load.

(In practice compilers don't optimize atomics at all, basically treating them all like volatile atomic, even for mo_relaxed. Why don't compilers merge redundant std::atomic writes? and http://wg21.link/n4455 + http://wg21.link/p0062. It's too hard / no mechanism exists to let compilers know when not to.)

But yes, the ISO C++ standard on paper makes no guarantee that other threads can actually observe any given intermediate state.

Thought experiment: Consider a C++ implementation on a single-core cooperative multi-tasking system. It implements std::thread by inserting yield calls where needed to avoid deadlocks, but not between every instruction. Nothing in the standard requires a yield between num++ and num-- to let other threads observe that state.

The as-if rule basically allows a compiler to pick a legal/possible ordering and decide at compile-time that it's what happens every time.

In practice this can create fairness problems if an unlock/re-lock never actually gives other threads the chance to take a lock if --/++ are combined together into just a memory barrier with no modification of the atomic object! This among other things is why compilers don't optimize.

Any stronger ordering for one or both of the operations can begin or be part of a release-sequence that synchronizes-with a reader. A reader that does an acquire load of a release store/RMW Synchronizes-With this thread, and must see all previous effects of this thread as having already happened.

IDK how the reader would know that it was seeing this thread's release-store instead of some previous value, so a real example is probably hard to cook up. At least we could create one without possible UB, e.g. by reading the value of another relaxed atomic variable so we avoid data-race UB if we didn't see this value.

Consider the sequence:

// broken code where optimization could fix it
    memcpy(buf, stuff, sizeof(buf));

    done.store(1, mo_relaxed);       // relaxed: can reorder with memcpy
    done.fetch_add(-1, mo_relaxed);
    done.fetch_add(+1, mo_release);  // release-store publishes the result

This could optimize to just done.store(1, mo_release); which correctly publishes a 1 to the other thread without the risk of the 1 being visible too soon, before the updated buf values.

But it could also optimize just the cancelling pair of RMWs into a fence after the relaxed store, which would still be broken. (And not the optimization's fault.)

// still broken
    memcpy(buf, stuff, sizeof(buf));

    done.store(1, mo_relaxed);       // relaxed: can reorder with memcpy
    atomic_thread_fence(mo_release);

I haven't thought of an example where safe code becomes broken by a plausible optimization of this sort. Of course just removing the pair entirely even when they're seq_cst wouldn't always be safe.

A seq_cst increment and decrement does still create a sort of memory barrier. If they weren't optimized away, it would be impossible for earlier stores to interleave with later loads. To preserve this, compiling for x86 would probably still need to emit mfence.

Of course the obvious thing would be a lock add [x], 0 which does actually do a dummy RMW of the shared object that we did x++/x-- on. But I think the memory barrier alone, not coupled to an access to that actual object or cache line is sufficient.

And of course it has to act as a compile-time memory barrier, blocking compile-time reordering of non-atomic and atomic accesses across it.

For acq_rel or weaker fetch_add(0) or cancelling sequence, the run-time memory barrier might happen for free on x86, only needing to restrict compile-time ordering.

See also a section of my answer on Can num++ be atomic for 'int num'?, and in comments on Richard Hodges' answer there. (But note that some of that discussion is confused by arguments about when there are modifications to other objects between the ++ and --. Of course all ordering of this thread's operations implied by the atomics must be preserved.)

As I said, this is all hypothetical and real compilers aren't going to optimize atomics until the dust settles on N4455 / P0062.

"_For mo_relaxed atomics, yes I think it could in theory be optimized away completely_" No, wrong conclusion was suggested in my Q. **In many cases but not in general.** See my own A. — curiousguy, Nov 15 '19 at 20:25

Nicol Bolas · Answer 2 · 2019-11-02T04:23:43.243

The C++ memory model provides four coherence requirements for all atomic accesses to the same atomic object. These requirements apply regardless of the memory order. As stated in a non-normative notation:

The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads.

Emphasis added.

Given that both operations are happening to the same atomic variable, and the first definitely happens before the second (due to being sequenced before it), there can be no reordering of these operations. Again, even if relaxed operations are used.

If this pair of operations were removed by a compiler, that would guarantee that no other threads would ever see the incremented value. So the question now becomes whether the standard would require some other thread to be able to see the incremented value.

It does not. Without some way for something to guarantee-ably "happen after" the increment and "happen before" the decrement, there is no guarantee that any operation on any other thread will certainly see the incremented value.

This leaves one question: does the second operation always undo the first? That is, does the decrement undo the increment? That depends on the scalar type in question. ++ and -- are only defined for the pointer and integer specializations of atomic. So we only need to consider those.

For pointers, the decrement undoes the increment. The reason being that the only way incrementing+decrementing a pointer would not result in the same pointer to the same object is if incrementing the pointer was itself UB. That is, if the pointer is invalid, NULL, or is the past-the-end pointer to an object/array. But compilers don't have to consider UB cases since... they're undefined behavior. In all of the cases where incrementing is valid, pointer decrementing must also be valid (or UB, perhaps due to someone freeing the memory or otherwise making the pointer invalid, but again, the compiler doesn't have to care).

For unsigned integers, the decrement always undoes the increment, since wraparound behavior is well-defined for unsigned integers.

That leaves signed integers. C++ usually makes signed integer over/underflow into UB. Fortunately, that's not the case for atomic math; the standard explicitly says:

For signed integer types, arithmetic is defined to use two's complement representation. There are no undefined results.

Wraparound behavior for two's complement atomics works. That means increment/decrement always results in recovering the same value.

So there does not appear to be anything in the standard which would prevent compilers from removing such operations. Again, regardless of the memory ordering.

Now, if you use non-relaxed memory ordering, the implementation cannot completely remove all traces of the atomics. The actual memory barriers behind those orderings still have to be emitted. But the barriers can be emitted without emitting the actual atomic operations.

score 0 · Accepted Answer · answered Nov 15 '19 at 20:13

In the C/C++ memory model, can a compiler just combine and then remove redundant/NOP atomic modification operations,

No, the removal part is not allowed, at least not in the specific way the question suggests it would be allowed: the intent here is to describe valid source to source transformations, abstract tree to abstract tree, or rather a higher level description of the source code that encodes all the relevant semantic elements that might be needed for later phases of compilation.

The hypothesis is that code generation can be done on the transformed program, without ever checking with the original one. So only safe transformations that cannot break any code are allowed.

(Note: For weaker memory orders that still do something; for relaxed, there is no real question here.)

No. Even that is wrong: for even relaxed operations, unconditional removal isn't a valid transformation (although in most practical cases it's certainly valid, but mostly correct is still wrong, and "true in >99% practical cases" has nothing to do with the question):

Before the introduction of standard threads, a stuck program was an infinite loop was an empty loop performing no externally visible side effects: no input, output, volatile operation and in practice no system call. A program that will not ever perform something visible is stuck and its behavior is not defined, and that allows the compiler to assume pure algorithms terminate: loops containing only invisible computations must exit somehow (that includes exiting with an exception).

With threads, that definition is obviously not usable: a loop in one thread isn't the whole program, and a stuck program is really one with no thread that can make something useful, and forbidding that would be sound.

But the very problematic standard definition of stuck doesn't describe a program execution but a single thread: a thread is stuck if it will perform no side effect that could potentially have an effect on observable side effects, that is:

no observable obviously (no I/O)
no action that might interact with another thread

The standard definition of 2. is extremely large and simplistic, all operations on an inter-thread communication device count: any atomic operation, any action on any mutex. Full text for the requirement (relevant part in boldface):

[intro.progress]

The implementation may assume that any thread will eventually do one of the following:

terminate,

make a call to a library I/O function,

perform an access through a volatile glvalue, or

perform a synchronization operation or an atomic operation.

[ Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. — end note ]

That definition does not even specify:

an inter thread communication (from one thread to another)
a shared state (visible by multiple threads)
a modification of some state
an object that is not thread private

That means that all these silly operations count:

for fences:
- performing an acquire fence (even when followed by no atomic operation) in a thread that has at least once done an atomic store can synchronize with another fence or atomic operation
for mutexes:
- locking a locally recently created, patently useless function private mutex;
- locking a mutex to just unlock it doing nothing with the mutex locked;
for atomics:
- reading an atomic variable declared as const qualified (not a const reference to a non const atomic);
- reading an atomic, ignoring the value, even with relaxed memory ordering;
- setting a non const qualified atomic to its own immutable value (setting a variable to zero when nothing in the whole program sets it to a non zero value), even with relaxed ordering;;
- doing operations on a local atomic variable not accessible by other threads;
for thread operations:
- creating a thread (that might do nothing) and joining it seems to create a (NOP) synchronization operation.

It means no early, local transformation of program code that leaves no trace of the transformation to later compiler phases and that removes even the most silly and useless inter-thread primitive is absolutely, unconditionally valid according to the standard, as it might remove the last potentially useful (but actually useless) operation in a loop (a loop doesn't have to be spelled for or while, it's any looping construct, f.ex. a backward goto).

This however doesn't apply if other operations on inter-thread primitives are left in the loop, or obviously if I/O is done.

This looks like a defect.

A meaningful requirement should be based:

not only on using thread primitives,
not be about any thread in isolation (as you can't see if a thread is contributing to anything, but at least a requirement to meaningfully interact with another thread and not use a private atomic or mutex would be better then the current requirement),
based on doing something useful (program observables) and inter-thread interactions that contribute to something being done.

I'm not proposing a replacement right now, as the rest of the thread specification isn't even clear to me.

I think the infinite-loop UB is to allow for cooperative multi-tasking implementations which could add a `yield` call before or after atomic and volatile accesses. When compiling for a target where pre-emptive multi-tasking is assumed, compilers don't have to treat volatile and atomic accesses as special in this respect. — Peter Cordes, Nov 15 '19 at 20:33
Also note that if a compiler doesn't remove empty infinite loops, it *could* safely remove mo_relaxed no-op RMWs like `+=0`. — Peter Cordes, Nov 15 '19 at 20:35
@PeterCordes "_the infinite-loop UB is to allow for cooperative multi-tasking implementations_" I don't think so (not primary reason). That "pure number crunching loops (NCL) terminate" axiom predates threads. It allows the reordering of possibly visible operations w/ NCL, which I already discussed a few times as that king of reordering breaks progress bars, be it implemented w/ `printf("please wait...\n");` or a volatile watchdog or an atomic (w/ any memory order)... — curiousguy, Nov 15 '19 at 20:53
@PeterCordes "_allow for cooperative multi-tasking implementations_" Seriously, supporting "green threads", in the 2010ties? Is that still a thing? Writing spec thinking about the strangest arch/strange impl choices does not even mean you will get a fully conforming impl for these corner cases! At the end you may end up with a needlessly complicated spec and non conforming impl on special arch. — curiousguy, Nov 15 '19 at 21:00
Is that a thing? I don't know, probably not in mainstream use. But C++ rules *do* make C++ implementations on systems like Classic MacOS possible (cooperative multi-tasking even between processes, not just threads), if you had a compiler that inserts yields on its own in cases they might be needed. Whether actual implementations work a certain way or not has zero bearing on the standard; they like to leave the door open for as much as possible. Not standardizing on 2's complement or arithmetic right shifts is another example of C++ being needlessly complicated for no apparent reason. — Peter Cordes, Nov 15 '19 at 21:24

If a RMW operation changes nothing, can it be optimized away, for all memory orders?

3 Answers3

Linked