Weird optimization results for this multiply-add code

Question

I'm compiling this code:

#include <cstdint>

template <typename T>
struct vec{ T v[4]; };

template <typename T>
vec<T> foo (vec<T> x, vec<T> y, vec<T> z) {
    return {
        x.v[0] + y.v[0] * z.v[0],
        x.v[1] + y.v[1] * z.v[1],
        x.v[2] + y.v[2] * z.v[2],
        x.v[3] + y.v[3] * z.v[3]
    };
}

template vec<int64_t> foo ( vec<int64_t> x, vec<int64_t> y, vec<int64_t> z);
template vec<float> foo ( vec<float> x, vec<float> y, vec<float> z);

at maximum optimization, with clang 6.0 and gcc 7.3. But the results are weird:

No compiler uses fused multiply-adds - for integers or float, although these seem to be the obvious choice. Why?
gcc uses a bazillion instructions for the int64_t case (not for the float case), much more than clang and much more than itself at -O2. Is that really faster?

clang 6.0:

vec<long> foo<long>(vec<long>, vec<long>, vec<long>):             # @vec<long> foo<long>(vec<long>, vec<long>, vec<long>)
        mov     rax, qword ptr [rsp + 72]
        imul    rax, qword ptr [rsp + 40]
        add     rax, qword ptr [rsp + 8]
        mov     qword ptr [rdi], rax
        mov     rax, qword ptr [rsp + 80]
        imul    rax, qword ptr [rsp + 48]
        add     rax, qword ptr [rsp + 16]
        mov     qword ptr [rdi + 8], rax
        mov     rax, qword ptr [rsp + 88]
        imul    rax, qword ptr [rsp + 56]
        add     rax, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 16], rax
        mov     rax, qword ptr [rsp + 96]
        imul    rax, qword ptr [rsp + 64]
        add     rax, qword ptr [rsp + 32]
        mov     qword ptr [rdi + 24], rax
        mov     rax, rdi
        ret
vec<float> foo<float>(vec<float>, vec<float>, vec<float>):             # @vec<float> foo<float>(vec<float>, vec<float>, vec<float>)
        mulps   xmm2, xmm4
        addps   xmm0, xmm2
        mulps   xmm3, xmm5
        addps   xmm1, xmm3
        ret

GCC 7.3:

vec<long> foo<long>(vec<long>, vec<long>, vec<long>):
        movdqu  xmm3, XMMWORD PTR [rsp+56]
        mov     rax, rdi
        movdqu  xmm4, XMMWORD PTR [rsp+88]
        movdqa  xmm1, xmm3
        movdqa  xmm0, xmm3
        movdqa  xmm2, xmm4
        movdqu  xmm5, XMMWORD PTR [rsp+72]
        pmuludq xmm1, xmm4
        psrlq   xmm0, 32
        psrlq   xmm2, 32
        pmuludq xmm0, xmm4
        pmuludq xmm2, xmm3
        movdqu  xmm4, XMMWORD PTR [rsp+40]
        paddq   xmm0, xmm2
        psllq   xmm0, 32
        paddq   xmm0, xmm1
        movdqa  xmm3, xmm5
        movdqu  xmm1, XMMWORD PTR [rsp+24]
        movdqa  xmm2, xmm4
        psrlq   xmm3, 32
        pmuludq xmm3, xmm4
        paddq   xmm1, xmm0
        movdqu  xmm6, XMMWORD PTR [rsp+8]
        pmuludq xmm2, xmm5
        movdqa  xmm0, xmm4
        movups  XMMWORD PTR [rdi+16], xmm1
        psrlq   xmm0, 32
        pmuludq xmm0, xmm5
        paddq   xmm0, xmm3
        psllq   xmm0, 32
        paddq   xmm0, xmm2
        paddq   xmm0, xmm6
        movups  XMMWORD PTR [rdi], xmm0
        ret
vec<float> foo<float>(vec<float>, vec<float>, vec<float>):
        movq    QWORD PTR [rsp-40], xmm2
        movq    QWORD PTR [rsp-32], xmm3
        movq    QWORD PTR [rsp-56], xmm0
        movq    QWORD PTR [rsp-24], xmm4
        movq    QWORD PTR [rsp-16], xmm5
        movq    QWORD PTR [rsp-48], xmm1
        movaps  xmm0, XMMWORD PTR [rsp-40]
        mulps   xmm0, XMMWORD PTR [rsp-24]
        addps   xmm0, XMMWORD PTR [rsp-56]
        movaps  XMMWORD PTR [rsp-56], xmm0
        mov     rax, QWORD PTR [rsp-48]
        movq    xmm0, QWORD PTR [rsp-56]
        mov     QWORD PTR [rsp-56], rax
        movq    xmm1, QWORD PTR [rsp-56]
        ret

Looks like gcc auto-vectorizes 64-bit integer multiply using packed 32x32 => 64-bit multiply, using 3 `pmuludq` for each pair of multiplies. This doesn't look like a win, although do note that Skylake has 2 per clock throughput for `pmuludq`, but only 1 per clock for 64-bit integer multiply. It's probably a win with AVX2, doing the whole thing with one vector of 4 int64_t, and avoiding most of the `movdqa` with 3-operand VEX instructions. — Peter Cordes, Mar 13 '18 at 10:39
TL:DR: looks like overly aggressive auto-vectorization. Did you microbenchmark it for throughput and/or latency? It's probably worse on both counts, but probably not more than a factor of 2, maybe even less bad than that. — Peter Cordes, Mar 13 '18 at 10:40
*No compiler uses fused multiply-adds*: You wrote integer code. The only integer mul+add instructions are horizontal adds like https://github.com/HJLebbink/asm-dude/wiki/PMADDWD, until AVX512IFMA, and [VPMADD52LUQ](https://github.com/HJLebbink/asm-dude/wiki/VPMADD52LUQ) only operates on the low 52 bits of an integer element. (It's not a coincidence that this is the mantissa width of `double`: the point of AVX512-IFMA is obviously to expose the FMA unit for integer use without actually building it wider. — Peter Cordes, Mar 13 '18 at 10:43
@PeterCordes: See edit; it doesn't happen for floats either. Also, no, I wanted to understand what the rationale is. Plus - I only have one single platform I could potentially micro-benchmark on. — einpoklum, Mar 13 '18 at 11:12
FMA isn't baseline for x86-64! If gcc used FMA instructions by default, it would make code that faulted with SIGILL on some machines. You have to compile with `-mfma`, or `-march=haswell` or `-march=bdver2` or whatever to enable FMA + more, and set `-mtune=haswell`. (`gcc -O3 -march=native` is good for local use.) — Peter Cordes, Mar 13 '18 at 11:31
Clang does do floating point contraction by default but GCC does. Add `-ffp-contract=fast` to clang and enable FMA hardware https://godbolt.org/g/kuQv6Z — Z boson, Mar 13 '18 at 13:28

score 4 · Answer 1 · answered Mar 13 '18 at 13:47

First you need to enable FMA hardware e.g. with -mfma then with Clang you need to tell it to contract -ffp-contract=fast (GCC and ICC do this by default) or add #pragma STDC FP_CONTRACT ON https://stackoverflow.com/a/34461738/2542702. With Clang this produces

    vfmadd213ps     xmm2, xmm4, xmm0
    vfmadd213ps     xmm3, xmm5, xmm1
    vmovaps xmm0, xmm2
    vmovaps xmm1, xmm3

For best results with GCC use vector extensions

typedef float float4 __attribute__((vector_size(sizeof(float)*4)));

float4 foof(float4 z, float4 y, float4 x) {
    return x + y*z;
}

With GCC and Clang this produces simply

vfmadd132ps     xmm0, xmm2, xmm1

https://godbolt.org/g/CrffNR

Clang seems to do a better job with arrays and looping over vector extensions than GCC in my experience but GCC's vector extensions in GCC are the best supported.

Wouldn't `-ffp-contract=on` be enough? Also, why should the `vector_size` attribute be useful? Finally, with C++17, can't we do `[[vector_size(sizeof(float)*4)]]`? — einpoklum, Mar 13 '18 at 14:23
@einpoklum, I don't know about `-ffp-contract=on`. Try it and find out. But I expect you need `fast`. Vector extensions are useful because GCC currently generates much better code with them. I don't know about C++17 syntactic sugar. — Z boson, Mar 13 '18 at 14:30

Weird optimization results for this multiply-add code

1 Answers1