Vectorization with GCC and GFORTRAN

Question

I have a trivial loop which I am expecting to see YMM registers in the assembly, but am only seeing XMM

program loopunroll
integer i
double precision x(8)
do i=1,8
   x(i) = dble(i) + 5.0d0
enddo
end program loopunroll

Then I compile it (gcc or gfortran does not matter. I am using gcc 8.1.0)

[user@machine avx]$ gfortran -S -mavx loopunroll.f90
[user@machine avx]$ cat loopunroll.f90|grep mm
[user@machine avx]$ cat loopunroll.s|grep mm
    vcvtsi2sd       -4(%rbp), %xmm0, %xmm0
    vmovsd  .LC0(%rip), %xmm1
    vaddsd  %xmm1, %xmm0, %xmm0
    vmovsd  %xmm0, -80(%rbp,%rax,8)

But if I do this will intel parallel studio 2018 update3:

[user@machine avx]$ ifort -S -mavx loopunroll.f90
[user@machine avx]$ cat loopunroll.s|grep mm                                                 vmovdqu   .L_2il0floatpacket.0(%rip), %xmm2             #11.8
    vpaddd    .L_2il0floatpacket.2(%rip), %xmm2, %xmm3      #11.15
    vmovupd   .L_2il0floatpacket.1(%rip), %ymm4             #11.23
    vcvtdq2pd %xmm2, %ymm0                                  #11.15
    vcvtdq2pd %xmm3, %ymm5                                  #11.15
    vaddpd    %ymm0, %ymm4, %ymm1                           #11.8
    vaddpd    %ymm5, %ymm4, %ymm6                           #11.8
    vmovupd   %ymm1, loopunroll_$X.0.1(%rip)                #11.8
    vmovupd   %ymm6, 32+loopunroll_$X.0.1(%rip)             #11.8

I have also tried the flags -march=core-avx2 -mtune=core-avx2 for both gnu and intel and I still get the same result of XMM in the gnu-produced assembly, but YMM in the intel-produced assembly

What should I be doing differently please folks?

Many thanks, M

You forgot to enable optimization with `gfortran`. Use `gfortran -O3 -march=native`. (For gcc, `-ftree-vectorize` is only enabled at `-O3`, not `-O2`. And the default is `-O0`, i.e. compile fast and make terribly slow code that gives consistent debugging.) — Peter Cordes, Jul 02 '18 at 11:54
And BTW, everyone please vote to make the [tag:xmm] tag a synonym of [tag:sse]. https://stackoverflow.com/tags/sse/synonyms. — Peter Cordes, Jul 02 '18 at 11:58
Hi Peter, thanks for reply, but I tried that and that also fails. Infact when I use -O3 it throws away any reference to both XMM and YMM in the assembly. — Morph, Jul 02 '18 at 11:59
So write a function that doesn't optimize away completely! Or use inline asm with a `"memory"` clobber to stop the optimizer, if that's possible in GNU fortran. See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) for tips on writing useful functions for looking at compiler asm output. — Peter Cordes, Jul 02 '18 at 12:01

Peter Cordes · Accepted Answer · 2018-07-02T22:09:00.353

You forgot to enable optimization with gfortran. Use gfortran -O3 -march=native.

For that to not optimize away entirely, write a function (subroutine) that produces a result that code outside that subroutine can see. e.g. take x as an argument and store it. The compiler will have to emit asm that works for any caller, including one that cares about the contents of the array after calling the subroutine on it.

For gcc, -ftree-vectorize is only enabled at -O3, not -O2.

The gcc default is -O0, i.e. compile fast and make terribly slow code that gives consistent debugging.

gcc will never auto-vectorize at -O0. You must use -O3 or -O2 -ftree-vectorize.

The ifort default apparently includes optimization, unlike gcc. You should not expect ifort -S and gcc -S output to be remotely similar if you don't use -O3 for gcc.

when I use -O3 it throws away any reference to both XMM and YMM in the assembly.

It's a good thing when compilers optimize away useless work.

Write a function that takes an array input arg and writes an output arg, and look at asm for that function. Or a function that operates on two global arrays. Not a whole program, because compilers have whole-program optimization.

Anyway, see How to remove "noise" from GCC/clang assembly output? for tips on writing useful functions for looking at compiler asm output. That's a C Q&A but all the advice applies to Fortran as well: write functions that take args and return a result or have a side effect that can't optimize away.

http://godbolt.org/ doesn't have Fortran, and it looks like -xfortran doesn't work to make g++ compile as fortran. (-xc works to compile as C instead of C++ on Godbolt, though.) Otherwise I'd recommend that tool for looking at compiler output.

I made a C version of your loop to see what gcc does for presumably similar input to its optimizer. (I don't have gfortran 8.1 installed, and I barely know Fortran. I'm here for the AVX and optimization tags, but gfortran uses the same backend as gcc which I am very familiar with.)

void store_i5(double *x) {
    for(int i=0 ; i<512; i++) {
        x[i] = 5.0 + i;
    }
}

With i<8 as the loop condition, gcc -O3 -march=haswell and clang sensibly optimize the function to just copy 8 doubles from static constants, with vmovupd. Increasing the array size, gcc fully unrolls a copy for surprisingly large sizes, up to 143 doubles. But for 144 or more, it makes a loop that actually calculates. There's probably a tuning parameter somewhere to control this heuristic. BTW, clang fully unrolls a copy even for 256 doubles, with -O3 -march=haswell. But 512 is large enough that both gcc and clang make loops that calculate.

gcc8.1's inner loop (with -O3 -march=haswell) looks like this, using -masm=intel. (See source+asm on the Godbolt compiler explorer).

    vmovdqa ymm1, YMMWORD PTR .LC0[rip]  # [0,1,2,3,4,5,6,7]
    vmovdqa ymm3, YMMWORD PTR .LC1[rip]  # set1_epi32(8)
    lea     rax, [rdi+4096]              # rax = endp
    vmovapd ymm2, YMMWORD PTR .LC2[rip]  # set1_pd(5.0)

.L2:                                   # do {
    vcvtdq2pd       ymm0, xmm1              # packed convert 4 elements to double
    vaddpd  ymm0, ymm0, ymm2                # +5.0
    add     rdi, 64
    vmovupd YMMWORD PTR [rdi-64], ymm0      # store x[i+0..3]
    vextracti128    xmm0, ymm1, 0x1
    vpaddd  ymm1, ymm1, ymm3                # [i0, i1, i2, ..., i7] += 8 packed 32-bit integer add (d=dword)
    vcvtdq2pd       ymm0, xmm0              # convert the high 4 elements
    vaddpd  ymm0, ymm0, ymm2
    vmovupd YMMWORD PTR [rdi-32], ymm0
    cmp     rax, rdi
    jne     .L2                        # }while(p < endp);

We can defeat constant propagation for a small array by using an offset, so the values to be stored are not a compile-time constant anymore:

void store_i5_var(double *x, int offset) {
    for(int i=0 ; i<8; i++) {
        x[i] = 5.0 + (i + offset);
    }
}

gcc uses basically the same loop body as above, with a bit of setup but the same vector constants.

Tuning options:

gcc -O3 -march=native on some targets will prefer auto-vectorizing with 128-bit vectors, so you still won't get YMM registers. You can use -march=native -mprefer-vector-width=256 to override that. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). (Or with gcc7 and earlier, -mno-prefer-avx128`.)

gcc prefers 256-bit for -march=haswell because the execution units are fully 256-bit, and it has efficient 256-bit loads/stores.

Bulldozer and Zen split 256-bit instructions into two 128-bit internally, so it can actually be faster to run twice as many XMM instructions, especially if your data isn't always aligned by 32. Or when scalar prologue / epilogue overhead is relevant. Definitely benchmark both ways if you're using an AMD CPU. Or actually for any CPU it's not a bad idea.

Also in this case, gcc doesn't realize that it should use XMM vectors of integers and YMM vectors of doubles. (Clang and ICC are better at mixing different vector widths when appropriate). Instead it extracts the high 128 of a YMM vector of integers every time. So one reason that 128-bit vectorization sometimes wins is that sometimes gcc shoots itself in the foot when doing 256-bit vectorization. (gcc's auto-vectorization is often clumsy with types that aren't all the same width.)

With -march=znver1 -mno-prefer-avx128, gcc8.1 does the stores to memory with two 128-bit halves, because it doesn't know if the destination is 32-byte aligned or not (https://godbolt.org/g/A66Egm). tune=znver1 sets -mavx256-split-unaligned-store. You can override that with -mno-avx256-split-unaligned-store, e.g. if your arrays usually are aligned but you haven't given the compiler enough information.

Hi Peter I have tried
-O3 -march=native
-ftree-vectorize and have also tried the -mtune flag, -fpeel-loops, -mavx2, unrolling loops an dnone work. I also dont want to 'throw the baby out with the bath water' at this stage by using -O2 or -O3 Th ewhole point here is break this toy code down and really understand how to get the gcc assembler to produce YMM registers in the assembly — Morph, Jul 02 '18 at 15:40
@Morph: updated my answer with an example of what I was suggesting. You must use `-O3` to get auto-vectorization, but you also must change your source so it isn't just a small copy of the compile-time-constant result. — Peter Cordes, Jul 02 '18 at 19:21
BTW, thanks for taking the time to respond to my post and embellish where needed (I am a COMPLETE newbie to assembly). — Morph, Jul 02 '18 at 20:44
@Morph: Ok, `-march=native` on your machine is `-march=znver1`. gcc's `tune=znver1` might include `-mprefer-avx128`, so you might sometimes see auto-vec with only XMM vectors. (You can see the full set of options enabled/implied by your command line with `-S -fverbose-asm`: you get a big block of comments at the top.) — Peter Cordes, Jul 02 '18 at 20:50
@Morph: with Fortran, are you still writing a whole *program* rather than a subroutine? If so, the compiler can see that the result isn't used in the whole program, and optimize it away. Remember that you don't need it to link or run, you just need to write a function that compiles to a `.o` or `.s`. Or to make a complete program, pass your array as an arg to a function in another source file. (If you compile without link-time optimization, it can't inline, so the compiler can't omit creating the array data in memory.) — Peter Cordes, Jul 02 '18 at 20:54
@Morph: what exactly did you do? `gfortran -O3 -march=haswell -S -fverbose-asm` on a version of your program that calls a subroutine from another file on the array after the loop? If you got `pd`, then gcc is auto-vectorizing. If it only used XMM but not YMM, then it's probably a Zen tuning option that's making it choose not to use 256-bit vectors, so try `-O3 -march=haswell` or `-O3 -march=native -mprefer-vector-width=256` (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). — Peter Cordes, Jul 02 '18 at 21:16
i made a bad job of reporting the latest result. Here it is over a few comments gfortran -S -march=znver1 -fverbose-asm loopunroll.f90 gives # -fchkp-use-wrappers -fcommon -fdelete-null-pointer-checks # -mieee-fp -mlong-double-80 -mlzcnt -mmmx -mmovbe -mmwaitx -mpclmul vcvtsi2sd -4(%rbp), %xmm0, %xmm0 # i, _1 vmovsd .LC0(%rip), %xmm1 #, tmp94 vaddsd %xmm1, %xmm0, %xmm0 # tmp94, _1, _4 vmovsd %xmm0, -4112(%rbp,%rax,8) # _4, x — Morph, Jul 02 '18 at 21:19
same result whether I use -march=znver1 or haswell... just xmm in assembly. I have to leave the -O3 out of this otherwise no XMM or YMM — Morph, Jul 02 '18 at 21:20
Also tried -O3 -march=native -mprefer-vector-width=256 with no YMM outcome — Morph, Jul 02 '18 at 21:20
@Morph: What did you do that produced a `vaddpd`? Everything without `-O3` is useless, stop wasting our time with un-optimized code. It will *never* be auto-vectorized at all without optimization. — Peter Cordes, Jul 02 '18 at 21:22
-march=znver1 -O3 -fverbose-asm produces: # -fchkp-use-wrappers -fcommon -fdelete-null-pointer-checks # -mieee-fp -mlong-double-80 -mlzcnt -mmmx -mmovbe -mmwaitx -mpclmul vcvtsi2sd -4(%rbp), %xmm0, %xmm0 # i, _1 vmovsd .LC0(%rip), %xmm1 #, tmp94 vaddsd %xmm1, %xmm0, %xmm0 # tmp94, _1, _4 vmovsd %xmm0, -4112(%rbp,%rax,8) # _4, x — Morph, Jul 02 '18 at 21:24
@Morph: That doesn't look like `-O3` output, it looks like `-O0`. `vcvtsi2sd -4(%rbp), %xmm0, %xmm0` shows that the compiler stored the integer loop counter to the stack, and made a stack frame with RBP. The first one especially (reloading the counter from memory) is a huge sign that this is definitely un-optimized code, so you left out the `-O3`. (Optimized would use `vcvtsi2sd %eax, ...`). Please stop wasting my time posting un-optimized `-O0` asm output. Write a stand-alone fortran subroutine that takes an array arg by reference and writes the result into it, like my C function does. — Peter Cordes, Jul 02 '18 at 21:27
@Morph: updated the top of my answer with a simpler version of how to make it not optimize away. (When I started this answer, I think I didn't notice initially that it could just optimize away, that's why that key point wasn't more obvious from the start.) — Peter Cordes, Jul 02 '18 at 21:38
I made a mistake in leavign a -O0 in the script... my very stupid bad. I have cleaned things up also. However it only works with -march=haswell If I use -march=znver1 its just going back to XMM. I dont think we are home and dry yet :( — Morph, Jul 02 '18 at 21:42
@Morph: I just updated the bottom of my answer about tuning options. 128-bit vectorization is sometimes optimal for Zen. But you can override it with tuning options. Benchmark both ways! — Peter Cordes, Jul 02 '18 at 21:45

score 1 · Answer 2 · answered Jul 02 '18 at 21:40

Just to tidy this up, Peters advice was correct. My code now looks like:

program loopunroll

double precision x(512)
call looptest(x)

end program loopunroll

subroutine looptest(x)
  integer i
  double precision x(512)
  do i=1,512
     x(i) = dble(i) + 5.0d0
  enddo
  return
end subroutine looptest

and the way to produce the YMM is with

 gfortran -S  -march=haswell -O3 loopunroll.f90

Vectorization with GCC and GFORTRAN

2 Answers2

Tuning options: