Dot product performance with SSE instructions

Question

Is it faster to calculate the dot product of two vectors by the means of the dpps instruction form the SSE 4.1 instruction set or by using a series of addps, shufps and mulps from SSE 1?

Related: http://stackoverflow.com/q/18499971/1207195 but nothing is better than a good old performance test... — Adriano Repetti, Jun 17 '16 at 10:53
Is there more context to this? Often the entire situation of having to do a horizontal dot product in the first place can be avoided — harold, Jun 17 '16 at 10:54
For example if you're calculating the dot product between two larger vectors, that shouldn't be built up from dotting small parts and adding them (it can be, but that's a waste of horizontal ops), or if you're fundamentally doing tiny dot products, it's almost always better to not ever have that vector in a vector register, but instead use a vector with only x-coords, a vector with only y-coords etc. But maybe neither of these apply.. it depends — harold, Jun 17 '16 at 11:13
Why don't you just benchmark it ? As you presumably have to implement one version, trying the other won't be a big deal. — , Jun 17 '16 at 11:52

Chuck Walbourn · Accepted Answer · 2016-06-19T04:37:32.217

5

The answer is likely to be very contextual, and depend exactly on where and how it's used in the larger codeflow as well as exactly what hardware you are using.

Historically when Intel has introduced new instructions, they've not dedicated much hardware area to it. If it gets adopted and used enough, they put more hardware behind it in future generations. So _mm_dp_ps on Penryn wasn't particularly impressive compared to doing it the SSE2 way in terms of raw ALU performance. On the other hand, it does require fewer instructions in the I-cache so it could potentially help when a more compact encoding would perform better.

The real problem with _mm_dp_ps is as part of SSE 4.1, you can't count on it being supported on every even modern PC (Valve's Steam Hardware Survey pegs it at about 85% for gamers). Therefore, you end up having to write guarded code-paths rather than straight-line code, and that usually costs more than the benefits you get from using the instruction.

Where it is useful is if you are making a binary for a CPU that's guaranteed to support it. For example, if you are building with /arch:AVX (or even /arch:AVX2) either because you are targeting a fixed platform like the Xbox One or are building multiple versions of your EXE/DLL, you can assume SSE 4.1 will be supported as well.

This is effectively what DirectXMath does:

inline XMVECTOR XMVector4Dot( FXMVECTOR V1, FXMVECTOR V2 )
{
#if defined(_XM_NO_INTRINSICS_)

    XMVECTOR Result;
    Result.vector4_f32[0] =
    Result.vector4_f32[1] =
    Result.vector4_f32[2] =
    Result.vector4_f32[3] = V1.vector4_f32[0] * V2.vector4_f32[0] + V1.vector4_f32[1] * V2.vector4_f32[1] + V1.vector4_f32[2] * V2.vector4_f32[2] + V1.vector4_f32[3] * V2.vector4_f32[3];
    return Result;

#elif defined(_M_ARM) || defined(_M_ARM64)

    float32x4_t vTemp = vmulq_f32( V1, V2 );
    float32x2_t v1 = vget_low_f32( vTemp );
    float32x2_t v2 = vget_high_f32( vTemp );
    v1 = vpadd_f32( v1, v1 );
    v2 = vpadd_f32( v2, v2 );
    v1 = vadd_f32( v1, v2 );
    return vcombine_f32( v1, v1 );

#elif defined(__AVX__) || defined(__AVX2__)

    return _mm_dp_ps( V1, V2, 0xff );

#elif defined(_M_IX86) || defined(_M_X64)

    XMVECTOR vTemp2 = V2;
    XMVECTOR vTemp = _mm_mul_ps(V1,vTemp2);
    vTemp2 = _mm_shuffle_ps(vTemp2,vTemp,_MM_SHUFFLE(1,0,0,0));
    vTemp2 = _mm_add_ps(vTemp2,vTemp);
    vTemp = _mm_shuffle_ps(vTemp,vTemp2,_MM_SHUFFLE(0,3,0,0));
    vTemp = _mm_add_ps(vTemp,vTemp2);
    return _mm_shuffle_ps(vTemp,vTemp,_MM_SHUFFLE(2,2,2,2));

#else
    #error Unsupported platform
#endif
}

This of course assumes you are going to use the 'scalar' result of a dot-product in additional vector operations. By convention, DirectXMath returns such scalars 'splatted' across the return vector.

See DirectXMath: SSE4.1 and SSE4.2

UPDATE: While not quite as ubiquitous as SSE/SSE2 support, you could require SSE3 support for cases you aren't building with /arch:AVX or /arch:AVX2 and try:

inline XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
{
    XMVECTOR vTemp = _mm_mul_ps(V1,V2);
    vTemp = _mm_hadd_ps( vTemp, vTemp );
    return _mm_hadd_ps( vTemp, vTemp );
}

That said, it's not clear that hadd is much of a win in most cases for at least dot-product over the SSE/SSE2 add and shuffle solution.

edited Jun 19 '16 at 04:37

answered Jun 17 '16 at 15:09

Chuck Walbourn

38,259
2
58
81

1

`dpps` still takes 4 uops in the uop cache on Intel SnB-family CPUs, which is often a more precious resource than L1 I-cache. However, it does look like a win in uops over the fallback for that case. Unlike `haddps` for a horizontal sum, it does the whole thing in only 4 uops. – Peter Cordes Jun 17 '16 at 21:37
``haddps`` is SSE3 which is technically not required for x64 CPUs or Windows 8.1/Windows 10, although Valve Survey shows it at 99% so it's effectively supported everywhere for gamers. See [DirectXMath: SSE3 and SSSE3](https://blogs.msdn.microsoft.com/chuckw/2012/09/11/directxmath-sse3-and-ssse3/) – Chuck Walbourn Jun 18 '16 at 07:49
@Peter - How does the SSE3 version compare to the 'fallback'? – Chuck Walbourn Jun 18 '16 at 08:01
2

Looks worse for everything except code-size. `haddps` is 3 uops, 5c latency (on Haswell). So after the mulps, 2xhaddps is 4 shuffles and 2 adds. The SSE2 version is 3 shuffles and 2 adds, and 9c latency instead of 10. (not including the mul). – Peter Cordes Jun 18 '16 at 09:19
1

BTW, I hadn't looked at it before, but that's a nice use of `_mm_shuffle_ps` with 2 different input variables, to avoid needing a `movaps` to save the old value for the next add. However, one of those shuffles should be a `movhlps`: faster on Pentium-M / Merom / K8 (where shuffles are slow). See http://stackoverflow.com/a/35270026/224132 for some optimized horizontal sums. Hmm, I think you could avoid the last broadcast-shuffle, at a cost of 1 or 2 extra movaps, by shuffling so all the elements of the last add produce the full result. – Peter Cordes Jun 18 '16 at 09:29
Actually, I think it's always going to take two mov+shuffle, since `movshdup` doesn't swap, it just duplicates. If you care more about new CPUs (i.e. SSE2 baseline, but tuned for new CPUs that handle movaps with zero latency (IvyBridge+, and Bulldozer+)), it *might* be worth it to `mulps` / `movaps`/`shufps` / `addps` / `movaps`/`shufps` / `addps`. (SSE2 `pshufd` is a mov-and-shuffle, but will cause bypass delays on many CPUs when used between `addps` instructions :/). This is a lot of total uops, which isn't great for a horizontal sum that shouldn't be inside inner loops anyway. – Peter Cordes Jun 18 '16 at 09:37
My theory is that it's usually best to optimize hsums for minimal impact on the uop cache, to avoid slowing down surrounding code. Better latency/throughput and code-size come after minimizing fused-domain uops. I have not tested this theory with any actual tuning to see what has the least impact on a loop that does a few "vertical" adds. – Peter Cordes Jun 18 '16 at 09:40
Feel free to propose any changes to DirectMath via a pull request on[GitHub](https://github.com/Microsoft/DirectXMath). I have a new DirectXMath 3.09 pending with the upcoming Windows 10 Anniversary SDK and a 3.10 in the pipeline after that. – Chuck Walbourn Jun 18 '16 at 20:07
I'm more interested in Agner Fog's vector class library, and [I'm only part way through making some improvements to it](http://www.agner.org/optimize/vectorclass/read.php?i=124). My conclusion for this DirectMath function is that what you have is nearly optimal for everything except old SlowShuffle CPUs like Merom, and it's not worth implementing an SSE3 version. Feel free to copy my hsum code (using movshdup/movhlps) from that SO answer I linked, to save a couple bytes in the pre-SSE4.1 version and run faster on Merom. It's just a byte of code-size for newer CPUs with fast shufps, though. – Peter Cordes Jun 19 '16 at 04:18
Anger's vector library is implementing all dot products with two ``hadd`` instructions when built for SSE3. You planning to change that? BTW, Anger's library is GNU GPL, DirectXMath is MIT which doesn't matter for OSS projects, but makes a hell of a difference for commercial products :) – Chuck Walbourn Jun 19 '16 at 04:27
Yes, I've already fixed integer hsums to not use phadd* in my VCL git repo. And given limited time, I'd rather contribute to a GPL project. I've used GNU/Linux for nearly everything (except some games) since I've had my own computer (~20 years :). Anyway, feel free to use the same sequences of instructions for your project, or copy anything from my SO answers, but yeah, the VCL license does prevent you from directly copying whole chunks of code. I'd like to help out an MIT-license project, I just don't have time to do everything. – Peter Cordes Jun 19 '16 at 04:35

Dot product performance with SSE instructions

1 Answers1