gcc optimization better at -O0 than -O3

Question

I recently made some vector-code and an appropriate godbolt example.

typedef float v8f __attribute__((vector_size(32)));
typedef unsigned v8u __attribute__((vector_size(32)));

v8f f(register v8f x)
{
  return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});
}

f:
        vmovaps ymm1, ymm0
        vxorps  xmm0, xmm0, xmm0
        vperm2f128      ymm0, ymm1, ymm0, 33
        vpalignr        ymm0, ymm0, ymm1, 4
        ret

I wanted to see how different optimization (-O0/O1/O2/O3) settings affected the code, and all but -O0 gave identical code. -O0 gave the predictable frame-pointer garbage, and also copies the argument x to a stack local variable for no good reason. To fix this, I added the register storage class specifier:

typedef float v8f __attribute__((vector_size(32)));
typedef unsigned v8u __attribute__((vector_size(32)));

v8f f(register v8f x)
{
  return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});
}

For -O1/O2/O3, the generated code is identical, but at -O0:

f:
        vxorps  xmm1, xmm1, xmm1
        vperm2f128      ymm1, ymm0, ymm1, 33
        vpalignr        ymm0, ymm1, ymm0, 4
        ret

gcc figured out how to avoid a redundant register-copy. While such a copy might be move-eliminated, this still increases code size for no benefit (-Os is bigger than -O0?).

How/why does gcc generate better code for this at -O0 than -O3?

Of course in real life you'd be inlining this function. Do you still get redundant moves if the function is inlined into a more realistic context? — Nate Eldredge, May 23 '20 at 17:12
Looks to be similar to gcc bug [Sub-optimal YMM register allocation.](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91796) — Maxim Egorushkin, May 23 '20 at 17:15
@NateEldredge Good point! [godbolt](https://godbolt.org/z/UyxMyt) says that even inlined, the redundant `vmovaps ymm2, ymm0 vxorps xmm0, xmm0, xmm0` remains. Now that I've seen it, it *really* bugs me. WTH is `gcc` copying a register only to zero the source immediately? — EOF, May 23 '20 at 17:17
@MaximEgorushkin Looks very similar, but 1) my repo may be smaller (in terms of assembly generated) and 2) in my repo `-O0` generates perfect code. — EOF, May 23 '20 at 17:19
In my example `register` doesn't help, unfortunately: https://gcc.godbolt.org/z/j7ns57 — Maxim Egorushkin, May 23 '20 at 17:25
@MaximEgorushkin Yeah, it's not so much that `register` helps than that `-O0` (and [partially](https://godbolt.org/z/S8RMkB) `-O1`, no `vmovaps` here, but also no `vfnmadd231ps`, so a wash) seem to avoid the redundant register copies. Unfortunately, they obviously don't produce anywhere near as good code otherwise, and `register` can't compensate completely (also, being deprecated in C++?). — EOF, May 23 '20 at 17:32
@EOF: your "inlining" test is still inlining into a relatively tiny function; as [commented on Maxim's GCC bug](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91796#c7), these wasted MOV instructions are more common in tiny functions due to hard constraints from calling convention boundaries. They do sometimes happen for real after inlining into a loop or something non-trivial, but in my experience usually only when you want the 128-bit low and high halves of a vector and GCC decides to zero-extend the low half to 256 with an XMM `vmovdqa` for no reason. — Peter Cordes, May 23 '20 at 18:39
@PeterCordes Well, maybe. Of course you also wouldn't care so much about this if it only happens at the start of a function if the (inlined) function does a lot more work, proportionally. But a [variant](https://godbolt.org/z/F8nZr6) of the link I gave for borderline better `-O1`-optimization to MaximEgorushkin shows the redundant `vmovaps` *not* as the first instruction of the function, so I'm not convinced this is as rare as claimed. (Also, seems the `-O1`-code has a regression in `gcc 10.1` on godbolt.) — EOF, May 23 '20 at 18:47

gcc optimization better at -O0 than -O3

0 Answers0