Is gcc-c++ not optimizing atomic operations for current x86-64 processors

Question

Given the following test program:

#include <atomic>
#include <iostream>

int64_t process_one() {
        int64_t a;
        //Should be atomic on my haswell
        int64_t assign = 42;
        a = assign;
        return a;
}

int64_t process_two() {
        std::atomic<int64_t> a;
        int64_t assign = 42;
        a = assign;
        return a;
}

int main() {
        auto res_one = process_one();
        auto res_two = process_two();
        std::cout << res_one << std::endl;
        std::cout << res_two << std::endl;
}

Compiled with:

g++ --std=c++17 -O3 -march=native main.cpp

The code generated the following asm for the two functions:

00000000004007c0 <_Z11process_onev>:
  4007c0:       b8 2a 00 00 00          mov    $0x2a,%eax
  4007c5:       c3                      retq
  4007c6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4007cd:       00 00 00

00000000004007d0 <_Z11process_twov>:
  4007d0:       48 c7 44 24 f8 2a 00    movq   $0x2a,-0x8(%rsp)
  4007d7:       00 00
  4007d9:       0f ae f0                mfence
  4007dc:       48 8b 44 24 f8          mov    -0x8(%rsp),%rax
  4007e1:       c3                      retq
  4007e2:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4007e9:       00 00 00
  4007ec:       0f 1f 40 00             nopl   0x0(%rax)

Personally I don't speak much assembler but (and I might be mistaken here) it seems that process_two compiled to include all of process_one's and then some.

However, as far as I know, 'modern' x86-64 processors (e.g. Haswell, on which I compiled this) will do assignment atomically without the need for any extra operations (in this case I believe the extra operation is the mfence instruction in process_two).

So why wouldn't gcc just optimize the code in process two to behave exactly the case as process one ? Given the flags I compiled with.

Are there still cases where an atomic store behaves differently than an assignment to a normal variable given that they are both on 8 bytes.

I don't see any floating point operations here, @Henri, so I'm not sure how that is relevant. George, the concept of "atomicity" doesn't make any sense when applied to local, temporary variables in a single-threaded process. Could G++ have noticed that and transformed `std::atomic` in `process_two` to a no-op? Probably. But why should it bother making that optimization? If you don't need atomic semantics, then don't use a type that provides them. — Cody Gray - on strike, Jul 20 '17 at 09:12
@Henri I'm only aware that -Ofast includes all -O3 optimization. Indeed it seems that for the current code O3 is exactly the same as Ofast and reading about Ofast it seems it only enabled some math related flag, so I will change that line as to not confuse people. — George, Jul 20 '17 at 09:27
@Cody Gray My question has more to do as to why atomic semantics are used at all here, when on my architecture said semantics aren't needed to make the function calls I used (load and store) atomic on a 64 bit value — George, Jul 20 '17 at 09:29
I don't understand your last comment. Are you agreeing with me that there is no point in using `std::atomic` in this code, and asking why the compiler doesn't optimize it away? Atomic semantics are used there because *you asked for them*. This doesn't have anything to do with x86, mind you. Atomicity has no meaning in this code on *any* processor. The reason why the compiler is generating code is because `std::atomic` also implies certain `std::memory_order` guarantees, and the `mfence` instruction is generated to enforce that. — Cody Gray - on strike, Jul 20 '17 at 09:33
[clang seems to do the best job with this](https://godbolt.org/g/sraCU4), while [ICC uses XCHG](https://godbolt.org/g/5SL5Ge). — Paul R, Jul 20 '17 at 09:33
@Code Gary, ok, let me put the question in another way: is there a difference in the context of that specific set of instructions between std::memory_order_relaxed and std::memory_order_seq_cst and if not why do they generate different instructions ? If there is a difference what is said difference ? — George, Jul 20 '17 at 09:40
The difference is, that the modern CPU doesn't execute machine code the same way, as you read it in assembly (order of instruction, or even register names, two times used `eax` in the source may be different physical register during execution). It does shuffle the instructions around as much as it likes, to better use the available resources (various parts of CPU capable to execute different tasks), while checking the constraints/dependencies "just enough" to produce identical result in the end, while being observed on single thread. For multi-thread fencing helps to observe expected results. — Ped7g, Jul 20 '17 at 09:49
std::memory_order_relaxed doesn't prevent CPU to execute instructions in different order from what is written in assembly code. — Marek Vitek, Jul 20 '17 at 09:54
@PaulR Not so fast with conclusions. Move that atomic variable outside the function scope as it would be in real MT program and situation changes. — Marek Vitek, Jul 20 '17 at 11:28
@MarekVitek: you may be right, but I was just using the OP's code as it appears above and comparing how different compilers deal with it. — Paul R, Jul 20 '17 at 11:55
@CodyGray - on reason it would be good to optimize such a pattern (e.g., by noting that the `atomic` variable is local and does not escape and eliding the atomic stuff) is for generic code. Imagine a template function that sums values by creating a local T on the stack as the accumulator. The writer doesn't know the type of T, but if someone passes an atomic it would be nice to have the writes to the local be non-atomic (of course the reads still need to be atomic). — BeeOnRope, Jul 23 '17 at 23:29
Related: [Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes/45971285). This isn't a duplicate, because it's asking why compilers don't optimize away atomic locals that it can prove are not accessed from other threads concurrently. — Peter Cordes, Sep 25 '17 at 00:23

Marek Vitek · Accepted Answer · 2017-07-20T09:47:32.623

10

The reason for it is that default use of std::atomic also implies memory order

std::memory_order order = std::memory_order_seq_cst

To achieve this consistency the compiler has to tell processor to not reorder instructions. And it does by using mfence instruction.

Change your

    a = assign;

to

    a.store(assign, std::memory_order_relaxed);

and your output will change from

process_two():
        mov     QWORD PTR [rsp-8], 42
        mfence
        mov     rax, QWORD PTR [rsp-8]
        ret

to

process_two():
        mov     QWORD PTR [rsp-8], 42
        mov     rax, QWORD PTR [rsp-8]
        ret

Just as you expected it to be.

edited Jul 20 '17 at 09:47

answered Jul 20 '17 at 09:39

Marek Vitek

1,573
9
20

4

Correct. Minor nitpick: "tell processor to not reorder instructions" is not 100% correct (since `mfence` doesn't serialise). I can't come with a short sentence describing all the nuisances of memory ordering and visibility on x86. I think something like "tell the processor to respect the memory ordering and visibility" might do. :) – Margaret Bloom Jul 20 '17 at 10:33
@MargaretBloom Yes you are right. It tells CPU that instructions that come before mfence will be executed before no matter in what order and instructions that come after will be executed after mfence. In other words stores before mfence will be visible to loads after. – Marek Vitek Jul 20 '17 at 10:57
1

Or maybe words from [documentation](http://x86.renejeschke.de/html/file_module_x86_id_170.html) "wait on following loads and stores until the preceding loads and stores are globally visible" – Marek Vitek Jul 20 '17 at 11:13
Without the mfence instruction, is it possible that two threads will view a difference value of the variable a "at the same time" ? Or does mfence only affect the order in which threads that reach that instruction execute the operation (e.g. if thread x and y reach the instruction whoever reaches it first executes it first) – George Jul 20 '17 at 12:15
mfence just ensures that instructions around it are in expected order. Let's say you have variable you want to pass to other thread and guard variable. You write to your variable some data and then you write to guard variable, that data are available. In other thred you check guard and then read data. Just imagine if the write to guard will be moved before write to data. Then in other thread you might think data are in place while they are not. Maybe check this [SO](https://stackoverflow.com/a/44892029/8113019) for some more info. It might shed some more light on your question. – Marek Vitek Jul 20 '17 at 12:30
To be clear, and to emphasize again what Margaret said, mfence (and more generally other barrier and `lock`ed instructions) **do not** say that _instructions_ are prevented from reordering across the fence (indeed, plenty of operations that don't involve memory will be freely moved across it), nor even that memory operations are executed on the side they appear in the code. It only means what it is documented to mean in terms of in what order stores become visible to loads across all CPUs. Implementations are free to do whatever they want as long as they respect that. – BeeOnRope Jul 22 '17 at 16:22
It's not just theoretical - many memory operations are moved around even across barriers and `lock` instructions, but in a way that the CPU knows won't violate ordering. For example a read of a cache line in an exclusive state can easily be moved in either direction across a barrier since that won't change memory ordering (and if the line loses exclusivity, there will be a memory order violation and a CPU clear to restore order). It's just very hard to talk about instruction ordering and "executing stuff" without often being wrong - better to talk about memory model guarantees only. – BeeOnRope Jul 22 '17 at 16:26
They are prevented from reordering in a way, so statement "that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible." is true. Only CPU engineers know what goes on inside. But we are interested on what is visible to program running on that CPU. I might have used wrong words, but it was just to explain why two codes differ. The instruction is there so writes and reads appear in right order with respect to mfence on asm level. No need to go deeper. – Marek Vitek Jul 22 '17 at 17:33
Right, the requirement is about _global visibility_, not reordering. Any reordering that respects that is fine (and many are done in practice). Really, the `mfence` is not even needed because the variable doesn't escape the function and won't participate in concurrency semantics, so fences at the assembly level since all the C++ semantics are preserved just by compiling both functions the same. – BeeOnRope Jul 23 '17 at 23:32
@MargaretBloom maybe nitpicking, but the Intel manual (vol. 2B, 4-22) states that `MFENCE` _Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction_ – LWimsey Jul 25 '17 at 12:58
@LWimsey Yes, I implicitly used the word "serialise" to mean "serialise all instruction". This is how Intel use it unless otherwise specified (as in the example you provided). My comment was also contextual in the original wording chosen by the OP that talked about the reordering of instructions in general. You see, it's easy to stumble upon something in this context :) – Margaret Bloom Jul 25 '17 at 13:44
@MargaretBloom Thank you, that makes sense – LWimsey Jul 25 '17 at 13:54

score 2 · Answer 2 · answered Jul 23 '17 at 23:50

2

It's just a missed optimization. For example, clang does just fine with it - both functions compile identically as a single mov eax, 42.

Now, you'd have to dig into the gcc internals to be sure, but it seems to be that gcc has not yet implemented many common and legal optimizations around atomic variables, including merging consecutive reads and writes. In fact, none of clang, icc or gcc seem to optimize much of anything yet except that clang handles local atomics (including passed-by-value) by essentially removing their atomic nature, which is useful in some cases such as generic code. Sometimes icc seems to generate especially bad code - see two_reads here, for example: it seems to only ever want to use rax as the address and as the accumulator, resulting in a stream of mov instructions shuffling things around.

Some more complex issues around atomic optimization are discussed here and I expect compilers will get better at this over time.

answered Jul 23 '17 at 23:50

BeeOnRope

60,350
16
207
386

Migrating the comment responses here, I agree that the template example is a good case where this optimization would be handy (and the standard go-to case for C++ code). I'm certainly not going to disagree that compilers should implement optimizations. :-) I think there are two basic reasons that atomic operations aren't optimized. One is the obvious: complexity, and the desire to avoid bugs. The second is just that `std::atomic` is pretty new on the scene, as far as things go, and compiler vendors haven't had much time to optimize around it. – Cody Gray - on strike Jul 24 '17 at 11:06
Some compilers just recently added support (looking at you, MSVC). I was just trying to figure out exactly what the question being asked here was. I interpreted it much the same way as you: why isn't the compiler optimizing this? And the answer is, of course, a simple missed optimization opportunity. But it seems that Marek interpreted the question in a slightly different way, and therefore took his answer in a different direction. Both are correct, of course, but I was hesitant to post an answer of my own until I could narrow down better what was being asked. Glad to see this posted too. – Cody Gray - on strike Jul 24 '17 at 11:08
Yes, I think they fail mostly because of the newness of the formal memory model and the `std::atomic` stuff, although perhaps I wasn't clear enough in my answer. Note that you don't need to invoke templates to make the "local atomic" optimization make sense: simply imagine any `std::atomic` member of a class: when that class is used as a local, you'd like not to pay the atomic cost and this kind of "escape analysis" can do it. FWIW I found that clang's optimization are actually quite limited - they get the assigned above, but if you change it to `+=` they emit atomic instructions. – BeeOnRope Jul 24 '17 at 19:32
@CodyGray - yup, if I really tried to the read the mind of the OP, I would say that he probably wanted to ask the question that Marek answered, which is something like "Why does simple assignment need fences/`lock` prefix when it is already atomic in the hardware" - since he mentions that. I'm answering the question posed by his code, which is "Why doesn't gcc optimize atomics which are provably restricted to one thread" which perhaps also useful to someone :) – BeeOnRope Jul 24 '17 at 19:35
This is an assignment with `memory_order_seq_cst`. It must make sure that all preceding stores become visible to other threads and subsequent loads do not get reordered before this store. Hence `mfence`. The compiler is doing the right thing here. – Maxim Egorushkin Sep 25 '17 at 14:50
@Maxim - well store-store reordering can't happen with regular stores on x86, so no fence is needed to prevent that. It does _in general_ need to prevent store-load reordering (the only type that happens on x86), and gcc likes to use `mfence` for this, but isn't necessary here since the variables are local and so the compiler can that there are no visible reorderings involving `a`. Unless you think clang is getting it wrong? It doesn't use fences at all. – BeeOnRope Sep 25 '17 at 17:38

Is gcc-c++ not optimizing atomic operations for current x86-64 processors

2 Answers2