weird auto-vectorization in gcc with different results on godbolt

Question

I'm confused by an auto-vectorization result. The following code addtest.c

#include <stdio.h>
#include <stdlib.h>

#define ELEMS 1024

int
main()
{
  float data1[ELEMS], data2[ELEMS];
  for (int i = 0; i < ELEMS; i++) {
    data1[i] = drand48();
    data2[i] = drand48();
  }
  for (int i = 0; i < ELEMS; i++)
    data1[i] += data2[i];
  printf("%g\n", data1[ELEMS-1]); 
  return 0;
}

is compiled with gcc 11.1.0 by

gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c

and the add-to loop is auto-vectorized as

.L3:
    vmovaps ymm1, YMMWORD PTR [r12]
    vaddps  ymm0, ymm1, YMMWORD PTR [rax]
    add r12, 32
    add rax, 32
    vmovaps YMMWORD PTR -32[r12], ymm0
    cmp r12, r13
    jne .L3

This is clear: load from data1, load and add from data2, store to data1, and in between, advance the indices.

If I pass the same code to https://godbolt.org, select x86-64 gcc-11.1 and options -O3 -march=haswell, I get the following assembly code:

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

One surprising thing is the different address handling, but the thing that confuses me completely is the additional store to [rbp-8240]. This location is never used again, as far as I can see.

If I select gcc 7.5 on godbolt, the superfluous store disappears (but from 8.1 upwards, it is produced).

So my questions are:

Why is there a difference between my compiler and godbolt (different address handling, superfluous store)?
What does the superfluous store do?

Thanks a lot for your help!

Have you seen [this post](https://stackoverflow.com/questions/23696323/disable-auto-vectorization-of-specific-loops-in-a-function-in-gcc)? — ryyker, Apr 13 '22 at 13:59
If you're going to link code on Godbolt, use the "short" or "full" link options to actually link to your code with your chosen compiler version/options, so people don't have to copy/paste. — Peter Cordes, Apr 14 '22 at 01:57

score 1 · Accepted Answer · answered Apr 14 '22 at 02:17

The difference-maker is -fpie, which is on by default in most distros but not Godbolt. This doesn't make a lot of sense, but compilers are complex pieces of machinery, not "smart".

It's not specific to -march=haswell or AVX either; the same difference happens with just -O3.

Godbolt configures GCC with simpler options than distros, e.g. without default-pie, and without -fstack-protector-strong. To match Godbolt locally, use at least -fno-pie -no-pie -fno-stack-protector. There might be others I'm forgetting about.

IDK why this would trigger or avoid a missed-optimization, but I can confirm it does on my Arch GNU/Linux system with GCC 11.1.

Locally with gcc -O3 -march=haswell -fno-stack-protector -fno-pie
(and -masm=intel -S -o- vec.c | less) it matches Godbolt:

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

But with distro-configured GCC defaults from -O3 -march=haswell:

.L3:
        vmovaps ymm1, YMMWORD PTR [r12]
        vaddps  ymm0, ymm1, YMMWORD PTR [rax]
        add     r12, 32
        add     rax, 32
        vmovaps YMMWORD PTR -32[r12], ymm0
        cmp     r12, r13
        jne     .L3

The same missed-opt happens without -march=haswell; we get a movaps XMMWORD PTR [rsp], xmm1 store to a fixed address inside the loop. (Since GCC doesn't need to over-align the stack to spill a 32-byte vector, it didn't use RBP as a frame pointer.)

For no apparent reason, using -fpie on the Godbolt compiler explorer gets GCC to use two pointer increments instead of indexed addressing modes, also avoiding the redundant store. (Making the same asm you get locally). -fpie forces GCC to do that for arrays in static storage (because [arr + rax] would require the symbol address as a 32-bit absolute: 32-bit absolute addresses no longer allowed in x86-64 Linux?)

You can and should report this on GCC's bugzilla with the keyword "missed-optimization".

Thanks a lot! Just a quick question: How can you find out which options are used by gcc on godbolt? `-v` doesn't show anything related to `pie` or `stack-protector`. — Ralf, Apr 14 '22 at 16:13
@Ralf: Those options are off by default on Godbolt, so that makes sense. If you check your local GCC, like `gcc -S foo.c -fverbose-asm` you might see a mention of those options. When I say "on by default", I don't mean anything is passing them, I mean GCC was configured at build-time with those options being on. e.g. [32-bit absolute addresses no longer allowed in x86-64 Linux?](https://stackoverflow.com/q/43367427) details the effects of `--enable-default-pie` at config time. — Peter Cordes, Apr 14 '22 at 20:39
@Ralf: As for how I knew about these, those are the two main options that make code-gen different locally from historical GCC defaults. Their effects are rather obvious in general. When accessing static arrays, or appending `@plt` when making function calls for `-fPIE` . And doing stuff with `fs:0x28` in functions containing local arrays (`-fstack-protector-strong`). These are options which distros have recently (past several years) started enabling by default. If there are any other different code-gen options, I'm not thinking of it at the moment. — Peter Cordes, Apr 14 '22 at 20:44
Thanks again! I can confirm that adding `-fno-pie -no-pie` reproduces the behavior of godbolt's compiler on my own machine. `-fno-stack-protector` doesn't seem to affect the auto-vectorization of the loop. — Ralf, Apr 15 '22 at 06:33

weird auto-vectorization in gcc with different results on godbolt

1 Answers1