19

I will preface this by saying that I am a complete beginner at SIMD intrinsics.

Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512.

I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.

In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).

Thank you for your help

Edit 1: I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?

Edit 2 If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.

cyrusbehr
  • 1,100
  • 1
  • 12
  • 32
  • 3
    You only want `dpps` for something like a 3 or 4-element dot product. For larger arrays, you want vertical FMA into multiple accumulators ([Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](//stackoverflow.com/q/45113527)). You'll want `-mfma -mavx2` or better `-march=native`. And yes, you need to enable target options for any intrinsic you want to use, along with `-O3`. – Peter Cordes Dec 27 '19 at 00:39
  • e.g. [How to properly use prefetch instructions?](//stackoverflow.com/q/48994494) shows the inner loop of a normal SIMD dot product. – Peter Cordes Dec 27 '19 at 00:45
  • `-O3` can't auto-vectorize a naive dot-product unless you use OpenMP or `-ffast-math` to tell the compiler to treat FP math as associative. – Peter Cordes Dec 27 '19 at 00:47
  • Sorry I meant to say `-OFas` instead of `-O3`, will update my question – cyrusbehr Dec 27 '19 at 00:49
  • 1
    `-Ofast` is currently equivalent to `-O3 -ffast-math` so yes that would work. But unfortunately GCC won't use multiple accumulators even if it does unroll an FP loop, so you'll bottleneck on SIMD FMA latency instead of throughput. (Factor of 8 on Skylake) – Peter Cordes Dec 27 '19 at 00:51
  • @PeterCordes if you have the time, I would very much appreciate if you could write a full implementation. Even with the information and links you have provided (which I am very grateful for), I am still confused on how to put it all together (as I am a complete beginner at this stuff). – cyrusbehr Dec 27 '19 at 00:58
  • 3
    Soonts' answer is exactly what I was describing. It requires AVX + FMA, not AVX2. Thanks, @Soonts, for writing it up. I think there's another Q&A somewhere with a good canonical dot-product implementation which I was looking for earlier but didn't find immediately. – Peter Cordes Dec 27 '19 at 02:44

1 Answers1

18

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.


GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.


The code below is probably close to theoretical performance limit of your CPU. Untested.

Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)

Or compile it with MSVC -O2 -arch:AVX2

#include <immintrin.h>
#include <vector>
#include <assert.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_mul_ps( a, b );
}

// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    assert( 0 == ( a.size() % 32 ) );
    if( a.empty() )
        return 0.0f;

    const float* p1 = a.data();
    const float* const p1End = p1 + a.size();
    const float* p2 = b.data();

    // Process initial 32 values. Nothing to add yet, just multiplying.
    __m256 dot0 = mul8<0>( p1, p2 );
    __m256 dot1 = mul8<1>( p1, p2 );
    __m256 dot2 = mul8<2>( p1, p2 );
    __m256 dot3 = mul8<3>( p1, p2 );
    p1 += 8 * 4;
    p2 += 8 * 4;

    // Process the rest of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 8 * 4;
        p2 += 8 * 4;
    }

    // Add 32 values into 8
    const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
    const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
    const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
    // Add 8 values into 4
    const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
    // Add 4 values into 2
    const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the final result
    const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r1 );
}

Possible further improvements:

  1. Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.

  2. Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.


To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.

(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Note that the unroll factor doesn't *have* to be a power of 2; e.g. unroll by 6 would be fine (e.g. if you were targeting 32-bit mode with only 8 vector regs, you might aim for 1 `vmovups` load and one `vfmadd` with a memory source operand, and have the other 7 vector regs as accumulators.) But with a power-of-2 (512) known size, 4 or 8 is fine. Apparently more accumulators can help in practice even though would in theory this would bottleneck on load throughput (2 loads per FMA). But for short vectors, probably best to keep code-size and cleanup compact. Not many total iterations. – Peter Cordes Dec 27 '19 at 02:48
  • The template stuff is not necessary for GCC / clang to use a memory source operand for FMA. It could do constant-propagation after inlining if you just did `p1+8*1` and so on. – Peter Cordes Dec 27 '19 at 02:56
  • I made some edits to your answer; I hope you don't mind the collaboration. If you do, roll it back and let me know so I can repost as my own answer. – Peter Cordes Dec 27 '19 at 03:03
  • @PeterCordes Yeah, I think the performance should be OK. However, numerical precision is probably not. For some input data, these intermediate 32-bit floats combined with almost random addition order may result in numerical issues. If I would do what OP is doing, I would consider accumulating in `__m256d` instead: more complex, slower, but more precise. The best tradeoff depends on the application, obviously. – Soonts Dec 27 '19 at 03:09
  • About template stuff, I don’t have high confidence in compilers. Constant propagation works almost always, but breaks in some rare cases. When the constant is passed as a template argument, compiler has much less freedom how to treat my code. – Soonts Dec 27 '19 at 03:10
  • 1
    SIMD + multiple accumulators is part way towards pairwise summation; a recognized technique for mitigating the error of summing an FP array (where adding a small element to a large total is bad). Having many smaller totals reduces rounding error significantly if the elements are mostly similar magnitude, especially if they're all positive. See my answer on [Simd matmul program gives different numerical results](//stackoverflow.com/a/55478102). – Peter Cordes Dec 27 '19 at 03:48
  • IIRC, FMA also makes efficient Kahan summation possible (but then of course you'd have to compile without `-ffast-math` or the compiler will optimize away the error compensation). Converting on the fly to `double` would be significantly slower, especially on Intel CPUs: `vcvtps2pd` on SKL costs 1 shuffle (port 5) + 1 FP uop (port 0 or 1), so it would bottleneck on 1x 128-bit of floats loaded+converted per clock vs. the current 2x 256-bit: 4x slower. You can probably do `float` Kahan for cheaper, and like I said multiple accumulators is already usually good. – Peter Cordes Dec 27 '19 at 03:51