Why are x86-64 C/C++ compilers not generating more efficient assembly for this code?

Question

Consider the following declaration of local variables:

bool a{false};
bool b{false};
bool c{false};
bool d{false};
bool e{false};
bool f{false};
bool g{false};
bool h{false};

in x86-64 architectures, I'd expect the optimizer to reduce the initialization of those variables to something like mov qword ptr [rsp], 0. But instead what I get with all the compilers (regardless of level of optimization) I've been able to try is some form of:

mov     byte ptr [rsp + 7], 0
mov     byte ptr [rsp + 6], 0
mov     byte ptr [rsp + 5], 0
mov     byte ptr [rsp + 4], 0
mov     byte ptr [rsp + 3], 0
mov     byte ptr [rsp + 2], 0
mov     byte ptr [rsp + 1], 0
mov     byte ptr [rsp], 0

Which seems like a waste of CPU cycles. Using copy-initialization, value-initialization or replacing braces with parentheses made no difference.

But wait, that's not all. Suppose that I have this instead:

struct
{
    bool a{false};
    bool b{false};
    bool c{false};
    bool d{false};
    bool e{false};
    bool f{false};
    bool g{false};
    bool h{false};
} bools;

Then the initialization of bools generates exactly what I'd expect: mov qword ptr [rsp], 0. What gives?

You can try the code above yourself in this Compiler Explorer link.

The behavior of the different compilers is so consistent that I am forced to think there is some reason for the above inefficiency, but I have not been able to find it. Do you know why?

perhaps alignment issue? The struct bools may have a different alignment guarantee than a single bool. — Andreas H., Aug 11 '20 at 09:33
If you were to pull 10^6 lines of C++ from a repo, how often would this 'pattern' occur ? I suspect the answer is 'not often', and that the compiler writers (yes, all of them) have spent their optimisation efforts more productively on more frequently occurring patterns. If you like, this is an economic decision, not a purely technical one. But this is just conjecture. — High Performance Mark, Aug 11 '20 at 09:33
@AlexLop Like I said in my question, the brace initialization is irrelevant. I can replace that by C-friendly `bool a = false` or `bool a(false)` or `bool a()`. The result is the same — TheProgammerd, Aug 11 '20 at 09:39
@AndreasH. The stack should be 16-byte aligned, but you can place a `void* X;` variable before the booleans (to make sure they are 8-byte aligned) and it does not change the generated code one bit — TheProgammerd, Aug 11 '20 at 09:40
@HighPerformanceMark That's a fair argument, but then why applying the optimization to the `bools` local struct variable and not to the local booleans? I don't think one pattern is more likely than the other — TheProgammerd, Aug 11 '20 at 09:42
The answer is presumably just "oversight": because your `foo` takes references to its parameters the compiler has to make sure each variable has an address, which is ever so slightly different from the struct case, where this optimization is more common. You could raise a bug with GCC and Clang to see what they have to say about it, but probably they just never needed to. — Botje, Aug 11 '20 at 09:48
It's more interesting with `char`s for gcc https://gcc.godbolt.org/z/3rEPEo — Waqar, Aug 11 '20 at 09:51
@TheProgammerd Ah, I did not see the "local" variables in your question and was assuming a global scope. There of course placement of variables is the task of the linker which might reorder/combine/omit variables. But not in your case of course — Andreas H., Aug 11 '20 at 13:14

score 2 · Accepted Answer · answered Aug 11 '20 at 18:21

Compilers are dumb, this is a missed-optimization. mov qword ptr [rsp], 0 would be optimal. Store forwarding from a qword store to a byte reload of any individual byte is efficient on modern CPUs. (https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/)

(Or even better, push 0 instead of sub rsp, 8 + mov, also a missed optimization because compilers don't bother looking for cases where that's possible.)

Presumably the optimization pass that looks for store merging runs before nailing down the locations of locals in the stack frame relative to each other. (Or before even deciding which locals can be kept in registers and which need memory addresses at all.)

Store merging aka coalescing was only recently reintroduced in GCC8 IIRC, after being dropped as a regression from GCC2.95 to GCC3, again IIRC. (I think other optimizations like assuming no strict-aliasing violations to keep more vars in registers more of the time, were more useful). So it's been missing for decades.

From one POV, you could say consider yourself lucky you're getting any store merging at all (with struct members, and array elements, that are known early to be adjacent). Of course, from another POV, compilers ideally should make good asm. But in practice missed optimizations are common. Fortunately we have beefy CPUs with wide superscalar out-of-order execution to usually chew through this crap to still see upcoming cache miss loads and stores pretty quickly, so wasted instructions sometimes have time to execute in the shadow of other bottlenecks. That's not always true, and clogging up space in the out-of-order execution window is never a good thing.

Related: In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values? covers the general case for constants other than 0, re: what the optimal asm would be. (The difference between array vs. separate locals was only discussed in comments there.)

How about `and qword [address], 0` here? I am not sure whether it applies to qword stores in 64-bit mode. Under 8086 compatibility I prefer `and word [address], 0` to `mov word [address], 0` because the `and` optimises to a sign-extended 8-bit immediate. Thus the `and` takes up one byte less. Disadvantage of that is presumably the RMW of `and` is slower than only write of `mov`. — ecm, Aug 15 '20 at 16:34
`mov qword [rsi], 0` results in 48C70600000000 (7 bytes) whereas `and qword [rsi], 0` is 48832600 (4 bytes). — ecm, Aug 15 '20 at 16:39
@ecm: I was only considering optimizing for speed, like `-O3` code-gen, not `-Os` or `clang -Oz` code-gen. At `-Os`, you'd consider something like `xor eax,eax` / `mov [rdi], rax` (2 + 3 = 5 bytes): still no false dependency on the old contents, but one extra front-end uop. (xor-zeroing is as cheap as a NOP on Intel CPUs.) At `-Oz` you'd go full code-golf and yes, `and qword [rsi], 0` is the smallest. Of course, `push 0` is only 2 bytes, so a smarter compiler that can initialize + reserve space with `push` gets to win big, also saving a dummy push or `sub rsp,8`. — Peter Cordes, Aug 15 '20 at 16:51

Why are x86-64 C/C++ compilers not generating more efficient assembly for this code?

1 Answers1

Linked