Why AVX dot product slower than native C++ code

Question

I have the following AVX and Native codes:

__forceinline double dotProduct_2(const double* u, const double* v)   
{  
    _mm256_zeroupper();   
    __m256d xy          = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
    __m256d temp        = _mm256_hadd_pd(xy, xy);
    __m128d dotproduct  = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1));
    return dotproduct.m128d_f64[0];
}

__forceinline double dotProduct_1(const D3& a, const D3& b)
{
    return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}

And respective test scripts:

std::cout << res_1 << " " << res_2 << " " << res_3 << '\n';
{
    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < (1 << 30); ++i)
    {
        zx_1 += dotProduct_1(aVx[i % 10000], aVx[(i + 1) % 10000]);
    }
    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

    std::cout << "NAIVE : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}

{
    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < (1 << 30); ++i)
    {
        zx_2 += dotProduct_2(&aVx[i % 10000][0], &aVx[(i + 1) % 10000][0]);
    }

    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

    std::cout << "AVX : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}

std::cout << math::min2(zx_1, zx_2) << " " << zx_1 << " " << zx_2;

Well, all of the data are aligned by 32. (D3 with __declspec... and aVx arr with _mm_malloc()..) And, as i can see, native variant is equal/or faster than AVX variant. I can't understand it's nrmally behaviour ? Because i'm think that AVX is 'super FAST' ... If not, how i can optimize it ? I compile it on MSVC 2015(x64), with arch AVX. Also, my hardwre is intel i7 4750HQ(haswell)

Post the ASM that MSVC makes from them. It probably auto-vectorizes with something more efficient than `vhaddpd`. Ideally it uses FMADDPD into a vector accumulator and saves all the horizontal summing for the end. If MSVC doesn't optimize away your manually-vectorized shuffles, then it's not going to be fast. See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86 for more about why `hadd` is the wrong shuffle. — Peter Cordes, Sep 30 '17 at 12:47
Not to mention that you can't inquire into performance until your have dependable benchmarks. — , Sep 30 '17 at 13:28
Here is ASM code(this part is enough?): https://sun9-11.userapi.com/c840724/v840724207/bfbd/kN62Ipo6TZ8.jpg — Des Spigel, Sep 30 '17 at 13:59
If AVX is not so fast, when should it be used? Or should I change the compiler ? — Des Spigel, Sep 30 '17 at 14:00
AVX is not the problem, you're just using it wrong. As usual, save the horizontal part for the end. — harold, Sep 30 '17 at 14:36
@DesSpigel: it means: do the partial dot products vertically in the loop, and then just do one final horizontal sum at the end. — Paul R, Sep 30 '17 at 16:06

robthebloke · Answer 1 · 2017-10-03T04:12:10.367

Simple profiling with basic loops isn't a great idea - it usually just means you are memory bandwidth limited, so the tests end up coming out at about the same speed (memory is typically slower than the CPU, and that's basically all you are testing here).

As others have said, your code example isn't great, because you are constantly going across the lanes (which I assume is just to find the fastest dot product, and not specifically because a sum of all the dot products is the desired result?). To be honest, if you really need a fast dot product (for AOS data as presented here), I think I would prefer to replace the VHADDPD with a VADDPD + VPERMILPD (trading an additional instruction for twice the throughput, and a lower latency)

double dotProduct_3(const double* u, const double* v)   
{  
    __m256d dp = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
    __m128d a = _mm256_extractf128_pd(dp, 0);
    __m128d b = _mm256_extractf128_pd(dp, 1);
    __m128d c = _mm_add_pd(a, b);
    __m128d yy = _mm_unpackhi_pd(c, c);
    __m128d dotproduct  = _mm_add_pd(c, yy);
    return _mm_cvtsd_f64(dotproduct);
}

asm:

dotProduct_3(double const*, double const*):
 vmovapd ymm0,YMMWORD PTR [rsi]
 vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
 vextractf128 xmm1,ymm0,0x1
 vaddpd xmm0,xmm1,xmm0
 vpermilpd xmm1,xmm0,0x3
 vaddpd xmm0,xmm1,xmm0
 vzeroupper 
 ret

Generally speaking, if you are using horizontal adds, you're doing it wrong! Whilst a 256bit register may seem ideal for a Vector4d, it's not actually a particularly great representation (especially if you consider that AVX512 is now available!). A very similar question to this came up recently: For C++ Vector3 utility class implementations, is array faster than struct and class?

If you want performance, then structure-of-arrays is the best way to go.

struct HybridVec4SOA
{
  __m256d x;
  __m256d y;
  __m256d z;
  __m256d w;
};
__m256d dot(const HybridVec4SOA& a, const HybridVec4SOA& b)
{
  return _mm256_fmadd_pd(a.w, b.w, 
         _mm256_fmadd_pd(a.z, b.z, 
         _mm256_fmadd_pd(a.y, b.y, 
         _mm256_mul_pd(a.x, b.x))));
}

asm:

dot(HybridVec4SOA const&, HybridVec4SOA const&):
 vmovapd ymm1,YMMWORD PTR [rdi+0x20]
 vmovapd ymm2,YMMWORD PTR [rdi+0x40]
 vmovapd ymm3,YMMWORD PTR [rdi+0x60]
 vmovapd ymm0,YMMWORD PTR [rsi]
 vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
 vfmadd231pd ymm0,ymm1,YMMWORD PTR [rsi+0x20]
 vfmadd231pd ymm0,ymm2,YMMWORD PTR [rsi+0x40]
 vfmadd231pd ymm0,ymm3,YMMWORD PTR [rsi+0x60]
 ret

If you compare the latencies (and more importantly throughput) of load/mul/fmadd compared to hadd and extract, and then consider that the SOA version is computing 4 dot products at a time (instead of 1), you'll start to understand why it's the way to go...

Avoiding `vhaddpd` is a win for throughput *and* latency. There's no tradeoff, `hadd` instructions just suck unless you can use them with 2 different inputs (e.g. as part of a transpose-and-add). Intel and AMD decode them to 2 shuffle uops that deinterleave odd/even to produce 2 inputs for a vertical add uop. (Presumably internally using the same shuffle uops as a shufps / shufpd). — Peter Cordes, Oct 03 '17 at 02:08
Thanks peter! I think I left my head unscrewed when I wrote that! Updated the comment. — robthebloke, Oct 03 '17 at 04:12

Elalfer · Answer 2 · 2017-10-02T18:45:20.220

2

You add too much overhead with vzeroupper and hadd instructions. Good way to write it, is to do all multiplies in a loop and aggregate the result just once at the end. Imagine you unroll original loop 4 times and use 4 accumulators:

for(i=0; i < (1<<30); i+=4) {
  s0 += a[i+0] * b[i+0];
  s1 += a[i+1] * b[i+1];
  s2 += a[i+2] * b[i+2];
  s3 += a[i+3] * b[i+3];
}
return s0+s1+s2+s3;

And now just replace unrolled loop with SIMD mul and add (or even FMA intrinsic if available)

edited Oct 02 '17 at 18:45

answered Oct 02 '17 at 18:33

Elalfer

5,312
20
25

1

The usual word is "accumulator", not "aggregator". But yes, exactly. Thanks for summarizing the comments into an answer. – Peter Cordes Oct 02 '17 at 18:38

Why AVX dot product slower than native C++ code

2 Answers2

Linked