3

I have std::vector<double> X,Y both of size N (with N%16==0) and I want to calculate sum(X[i]*Y[i]). That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel, Haswell or newer.

How do I get GCC to emit that AVX code? -mfma is part of the solution, but do I need other switches?

And is std::vector<double>::operator[] hindering this? I know I can transform

size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i != N; ++i) sum += X[i] * Y[i];

to

size_t N = X.size();
double sum = 0.0;
double const* Xp = &X[0];
double const* Yp = &X[0];
for (size_t i = 0; i != N; ++i) sum += Xp[i] * Yp[i];

so the compiler can spot that &X[0] doesn't change in the loop. But is this sufficient or even necessary?

Current compiler is GCC 4.9.2, Debian 8, but could upgrade to GCC 5 if necessary.

Z boson
  • 32,619
  • 11
  • 123
  • 226
MSalters
  • 173,980
  • 10
  • 155
  • 350
  • You need associative math do vectorize a reduction. This has nothing to do with fma. – Z boson Feb 17 '16 at 12:42
  • I think you mean AVX2 capable processors. Sandy Bridge and Ivy Bridge have AVX but do not have fma. Bulldozer has fma and avx but `-mfma` enables fma3 not fma4. – Z boson Feb 17 '16 at 13:21
  • @Zboson: How is `sum = (sum + X*Y)` not an FMA operation? In particular, it's an FMA3 operation. I don't see why you bring up vectorization reduction. – MSalters Feb 17 '16 at 13:32
  • I mean it appears you are interested in fma because it's potentially faster not because of the single rounding mode. If you're looking for speed you want to vectorize the loop irrespective of fma. Also it was not clear to me if you found a problem because if you had looked at the assmebly you would have seen that fma was already being used so I thought maybe the problem was that it was not vectorized. – Z boson Feb 17 '16 at 13:38
  • @Zboson: this single loop was absolutely dominating one of our applications, about 60% of the time was spent in there. Less rounding is also a benefit to us, but less so. – MSalters Feb 17 '16 at 13:47
  • Well did you try `-Ofast`? That's what you need to vectorize the reduction. As I said GCC does not unroll the loop anyway so it could be better. But if N is large then the operation is memory bandwidth bound so there is not much you can do except to do something else. – Z boson Feb 17 '16 at 13:54

2 Answers2

3

Did you look at the assembly? I put

double foo(std::vector<double> &X, std::vector<double> &Y) {
    size_t N = X.size();
    double sum = 0.0;
    for (size_t i = 0; i <N; ++i) sum += X[i] * Y[i];
    return sum;
}

into http://gcc.godbolt.org/ and looked at the assembly in GCC 4.9.2 with -O3 -mfma and I see

.L3:
        vmovsd  (%rcx,%rax,8), %xmm1
        vfmadd231sd     (%rsi,%rax,8), %xmm1, %xmm0
        addq    $1, %rax
        cmpq    %rdx, %rax
        jne     .L3

So it uses fma. However, it doest not vectorize the loop (The s in sd means single (i.e. not packed) and the d means double floating point).

To vectorize the loop you need to enable associative math e.g. with -Ofast. Using -Ofast -mavx2 -mfma gives

.L8:
        vmovupd (%rax,%rsi), %xmm2
        addq    $1, %r10
        vinsertf128     $0x1, 16(%rax,%rsi), %ymm2, %ymm2
        vfmadd231pd     (%r12,%rsi), %ymm2, %ymm1
        addq    $32, %rsi
        cmpq    %r10, %rdi
        ja      .L8

So now it's vectorized (pd means packed doubles). However, it's not unrolled. This is currently a limitation of GCC. You need to unroll several times due to the dependency chain. If you want to have the compiler do this for you then consider using Clang which unrolls four times otherwise unroll by hand with intrinsics.

Note that unlike GCC, Clang does not use fma by default with -mfma. In order to use fma with Clang use -ffp-contract=fast (e.g. -O3 -mfma -ffp-contract=fast) or #pragma STDC FP_CONTRACT ON or enable associative math with e.g. -Ofast You're going to want to enable associate math anyway if you want to vectorize the loop with Clang.

See Fused multiply add and default rounding modes and https://stackoverflow.com/a/34461738/2542702 for more info about enabling fma with different compilers.


GCC creates a lot of extra code to handle misalignment and for N not a multiples of 8. You can tell the compiler to assume the arrays are aligned using __builtin_assume_aligned and that N is a multiple of 8 using N & -8

The following code with -Ofast -mavx2 -mfma

double foo2(double * __restrict X, double * __restrict Y, int N) {
    X = (double*)__builtin_assume_aligned(X,32);
    Y = (double*)__builtin_assume_aligned(Y,32);
    double sum = 0.0;
    for (int i = 0; i < (N &-8); ++i) sum += X[i] * Y[i];
    return sum;
}

produces the following simple assembly

        andl    $-8, %edx
        jle     .L4
        subl    $4, %edx
        vxorpd  %xmm0, %xmm0, %xmm0
        shrl    $2, %edx
        xorl    %ecx, %ecx
        leal    1(%rdx), %eax
        xorl    %edx, %edx
.L3:
        vmovapd (%rsi,%rdx), %ymm2
        addl    $1, %ecx
        vfmadd231pd     (%rdi,%rdx), %ymm2, %ymm0
        addq    $32, %rdx
        cmpl    %eax, %ecx
        jb      .L3
        vhaddpd %ymm0, %ymm0, %ymm0
        vperm2f128      $1, %ymm0, %ymm0, %ymm1
        vaddpd  %ymm1, %ymm0, %ymm0
        vzeroupper
        ret
.L4:
        vxorpd  %xmm0, %xmm0, %xmm0
        ret
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    I actually see 3 `vfmadd231sd` instructions before and 3 after the single `vfmadd231pd`, presumably that's an alignment fix? – MSalters Feb 17 '16 at 13:58
  • 1
    @MSalters, yes, it's likely a memory alignment and a multiple of 8 fix. I added info to my answer about this. You can fix the multiple of 8 using `N & -8`. You said the arrays are multiple of 16 but 8 is sufficient. – Z boson Feb 17 '16 at 14:10
0

I'm not sure this will get you all the way there, but I'm almost sure that a big part of the solution.

You have to break the loop into two: 0 to N, with step M>1. I'd try with M of 16, 8, 4, and look at the asm. And a inner loop of 0 to M. Don't worry about the math iterator math. Gcc is smart enough with it.

Gcc should unroll the inner loop and them it can SIMD it and maybe use FMA.

BitWhistler
  • 1,439
  • 8
  • 12
  • Do you perhaps have a link ? It sounds like a transform that GCC should be able to do itself. – MSalters Feb 16 '16 at 21:12
  • No, no link. Just my own experience. 4.8 does not transform this. Should be easy enough to check the asm but I don't have 4.9 handy. – BitWhistler Feb 16 '16 at 21:17