Does the value of `std::memory_ordering` affect both compiler reordering and hardware instructions on atomic objects?

Question

I wonder, Is the value of the argument of type std::memory_ordering just prompt the compiler how to reorder the code, or does the value also affect the selection of instructions to operate atomic objects?

As said in https://en.cppreference.com/w/cpp/atomic/memory_order, for example:

memory_order_acquire: no reads or writes in the current thread can be reordered before this load.

This imposes the requirement on the compiler how to reorder the code. Assume the target platform has two atomic instructions: W0 addr, eax, W1 addr, eax. IIUC, the value of memory_order also affects the selection of which instruction will be used for an atomic object, right?

Another question is, if the value of the argument of type memory_order can only be determined on the runtime, how does the compiler know how to reorder the code according to the value?

The standard doesn't care (as such, about your hardware) - it just sets rules that must be followed. Whether it's the hardware the code is being generated for that makes sure those rules are followed or if its the compiler making sure (by inserting barriers or reordering instructions or whatever) is irrelevant. — Jesper Juhl, Aug 15 '23 at 08:46
Yes, there might be different instructions. Did you try checking the disassembly? Regarding order only known at runtime - I bet the compiler would just default to the strongest order. — HolyBlackCat, Aug 15 '23 at 08:48
@HolyBlackCat I'm not familiar with assembling, so I didn't try to look at the resulting asm. Do you agree with my understanding in the OP? — xmh0511, Aug 15 '23 at 08:58
Yes, compile time ordering is ensured, and if the hardware memory-ordering guarantees aren't as strong as this memory order then different or additional instructions will get emitted. (e.g. on x86-64, only `seq_cst` stores need anything different from what you'd expect; RMWs need the `lock` prefix for atomicity, and that's already a full barrier so `seq_cst`.) https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html . To look at the asm, see [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) — Peter Cordes, Aug 15 '23 at 08:59
Sounds correct to me. You don't need to understand asm to observe that the instructions are different. — HolyBlackCat, Aug 15 '23 at 08:59
Re: runtime-variable `memory_order` - GCC (and clang IIRC) just treat it as `seq_cst` instead of branching to avoid expensive barriers. This sometimes gives more compact but slower code, which is fine for debug builds which is the main place you find runtime-variable mem_order. — Peter Cordes, Aug 15 '23 at 09:00
@PeterCordes So, for an argument `std::memory_ordering`, the value does affect both compiler reordering and the actual instructions to operate atomic objects, right? — xmh0511, Aug 15 '23 at 09:13
@xmh0511: Yes, of course. The generated as has to implement the semantics defined in the C++ standard. This means using the right instructions in the right order. On some ISAs (notably x86) this doesn't require any special instructions, just ordering, for acq/rel or relaxed load and store. — Peter Cordes, Aug 15 '23 at 09:23
@PeterCordes By observing the asm in https://godbolt.org/, the writing to an atomic object(with `relaxed` ordering) and the writing to a non-atomic scalar object correspond to the same instruction, namely `mov dword ptr [rip + v], 1`, why does the standard [intro.races#21](https://eel.is/c++draft/intro.races#21) say the latter would result in UB if both of them are executed in multiple threads? [intro.races] p13 says non-scalar objects can only see visible side effects, however, atomic objects can see the side effect if the effect does not happen before the reading. — xmh0511, Aug 15 '23 at 09:43
Remember that architectures are (*very*) different. Even if one architecture uses the same instructions for sequential consistency and relaxed ordering, that doesn't mean that the same is true for other architectures. The C++ standard just sets the rules that must be followed - how those rules are implemented on x86/x86_64/ia64/arm/mips/alpha/m68000/Risc-V/Sparc/Power/whatever, may be very different. Something may be defined as UB because there are architectures out there where it's impossible to implement it or where implementing it might come with a great performance cost. — Jesper Juhl, Aug 15 '23 at 10:16
@xmh0511: UB is a phenomenon in the C++ abstract machine, not asm after compiling for a particular target. The data-race UB rule is what allows compilers to optimize away stores and reloads of non-`atomic` variables. No mainstream CPUs have hardware race detection, so they don't need special instructions for relaxed atomic load/store. [MCU programming - C++ O2 optimization breaks while loop](https://electronics.stackexchange.com/q/387181) — Peter Cordes, Aug 15 '23 at 15:46
@JesperJuhl: Not *just* because of hypothetical ISAs with hardware race detection. Also because data-race UB allows compilers to optimize single-threaded code (not using `atomic`) aggressively, like they did before C++11, e.g. keeping variables in registers. See my last comment and [Multithreading program stuck in optimized mode but runs normally in -O0](//stackoverflow.com/q/58516052). A better example of what you're talking about, some hardware having weird behaviour so C making it UB: [Compiler optimizations may cause integer overflow. Is that okay?](//stackoverflow.com/q/74102654) — Peter Cordes, Aug 15 '23 at 15:49
@PeterCordes So, from the perspective of a specific platform, the same instructions for writing to the atomic and non-atomic objects can imply they will behave the same on that platform if they are running in multiple threads, however, from the perspective of C++ standard, the non-atomic is UB? — xmh0511, Aug 17 '23 at 03:59
@xmh0511: Right. Plain load/store are atomic "for free" in asm for narrow enough naturally-aligned types on most ISAs ([including x86](https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic-on-x86/36685056#36685056)), but the optimizer needs to know that other threads might change the value (e.g. so a load after an assignment might get a different value back), and whether ordering wrt. other ops matters. Those are hugely important things that let compilers optimize non-atomic variables into registers, etc. — Peter Cordes, Aug 17 '23 at 04:07
@PeterCordes All right, that means, when the compiler sees the variable is defined with `atomic` type, the compiler won't optimize the variable into a register whatever the memory order is. Instead, when the compiler sees a variable defined with a non-atomic type, the compiler may put it into a register for the purpose of optimization. Anyway, if we see the compiler generate the same asm instructions for atomic and non-atomic objects, they will be provided to behave the same on that platform regardless of whether they are atomic in the C++ concept? — xmh0511, Aug 21 '23 at 02:23
Right. You can think of it as: if the C++ compiler decides to actually do a load or store at all, then when/wherever that operation actually happens in asm, it will be atomic at the asm level for some types (no tearing, but ordering can be as weak as `relaxed`, or the compiler might load the value multiple times when the source loads it once). Notice how many ifs and whens (compile time reordering / hoisting / sinking) went into that: even if the building blocks the compiler used happened to be atomic loads/stores, it's not even trying to put them together in a way that respects other threads. — Peter Cordes, Aug 21 '23 at 03:52
@PeterCordes *if the C++ compiler decides to actually do a load or store at all*, Do `load` and `store` here refer to general value reading and value writing instead of the `load` and `store` operation defined for atomic objects? — xmh0511, Aug 22 '23 at 01:55
@xmh0511: I'm referring to asm instructions that read or write memory. That's the domain the compiler is creating a program for, whose *observable* behaviour (in UB-free programs) has to match the C++ abstract machine. Optimization doesn't work by changing the C++ source, it works by taking the program logic (and the C++ rules that define what that logic means and which parts are observable behaviour that another thread might see without undefined behaviour) and transforming an internal representation of that. So no, absolutely not introducing anything like `std::atomic::load()` calls! — Peter Cordes, Aug 22 '23 at 02:01

score 2 · Answer 1 · answered Aug 18 '23 at 15:11

Yes, both.

The C++ memory model requires that atomic operations follow certain semantics, which depend on the specified memory ordering parameter. So the compiler has to emit code which, when executed, behaves according to those semantics.

For example, taking code like:

std::atomic<int> x;
int y, tmp;
if (x.load(std::memory_order_acquire) == 5) {
    tmp = y;
}

On a typical machine, the compiler would need to:

Not reorder the loads of x and y at compile time. In other words, it should emit a load instruction of x and a load instruction of y, such that the first is executed before the second in program order.
Ensure that the loads of x and y become visible in that order. If the machine is capable of out-of-order execution, speculative loads, or any other feature that could cause two loads to become visible out of program order, then the compiler must emit code that prevents it from happening in this instance.

What that code looks like, depends on the machine in question. Possibilities include:
- Nothing special is needed, because the machine doesn't do this particular kind of reordering. So x and y will just be loaded by ordinary load instructions, with nothing extra. This is the case on x86, for instance, where "all loads are acquire".
- Using a special form of the load instruction which inhibits reordering. For instance, on AArch64, the load of x would be done with the ldapr or ldar instruction instead of the ordinary ldr.
- Inserting a special memory barrier instruction between the two loads, like ARM's dmb.

In the vast majority of code, the memory ordering parameter is specified as a compile-time constant, because the programmer knows statically what ordering is required, and so the compiler can emit the instructions appropriate to that particular ordering.

In the unusual case where the ordering parameter is not a constant, then the compiler has to emit code that will behave properly no matter what value is specified. Usually what's done is that the compiler just treats the ordering parameter as being memory_order_seq_cst, since that is stronger than all the others: a seq_cst operation satisfies all the semantics required by the weaker orderings (and more besides). This saves the cost of actually testing the value of the ordering parameter at runtime and branching accordingly, which likely outweighs the potential savings of doing the operation with a weaker ordering.

But if the compiler did choose to test and branch, it would typically have to assume "worst case" for the purposes of optimizing surrounding code. For instance, on AArch64, for x.load(order) it might emit a chunk of code like the following:

int t;
if (order == std::memory_order_relaxed)
    LDR t, [x]
else if (order == std::memory_order_acquire)
    LDAPR t, [x]
else if (order == std::memory_order_seq_cst)
    LDAR t, [x]
else
    abort();
if (t == 5)
    LDR tmp, [y]

However, it would need to ensure that the load of y remained at the end of this chunk of code (in program order). If order were equal to std::memory_order_relaxed, then it would be okay to execute the load of y before the load of x, but not if it were std::memory_order_acquire or stronger.

On the other hand, it could conceivably emit

int t, t2;
if (order == std::memory_order_relaxed) {
    LDR t2, [y]
    LDR t, [x]
} else if (order == std::memory_order_acquire) {
    LDAPR t, [x]
    LDR t2, [y]
} else if (order == std::memory_order_seq_cst) {
    LDAR t, [x]
    LDR t2, [y]
else
    abort();
if (t == 5)
    tmp = t2;

but we are now well outside the range of transformations that a real-world compiler would actually perform.

On an ISA with simpler but more expensive barriers (like 32-bit ARM), it's plausible that a non-inline function with a memory order arg might compile to a load and then a conditional branch over a separate barrier instruction if it's not `acquire`, `acq_rel`, or `seq_cst`. (Or for an x86 store, to a branch over an `mfence` or dummy locked operation if not seq_cst, to optimize for the probably-common case of `release`). But yeah, current compilers just strengthen to `seq_cst` when a memory order isn't known at compile-time. — Peter Cordes, Aug 18 '23 at 23:13

Does the value of `std::memory_ordering` affect both compiler reordering and hardware instructions on atomic objects?

1 Answers1