4

Suppose I have a very simple code like:

double array[SIZE_OF_ARRAY];
double sum = 0.0;

for (int i = 0; i < SIZE_OF_ARRAY; ++i)
{
    sum += array[i];
}

I basically want to do the same operations using SSE2. How can I do that?

plasmacel
  • 8,183
  • 7
  • 53
  • 101
Peter Lee
  • 173
  • 1
  • 4
  • 8
  • If you really have to use double precision then it's probably not worth bothering, since most modern x86 CPUs have two FPUs these days. If you can drop down to single precision (i.e. float) then it might be worth doing. How much performance improvement do you need ? – Paul R Oct 01 '12 at 22:16
  • Highly recommend using Kahan summation. The solutions presented in the question and answer here are vulnerable to errors. – Cory Nelson Dec 28 '16 at 01:06

1 Answers1

5

Here's a very simple SSE3 implementation:

#include <emmintrin.h>

__m128d vsum = _mm_set1_pd(0.0);
for (int i = 0; i < n; i += 2)
{
    __m128d v = _mm_load_pd(&a[i]);
    vsum = _mm_add_pd(vsum, v);
}
vsum = _mm_hadd_pd(vsum, vsum);
double sum = _mm_cvtsd_f64(vsum0);

You can unroll the loop to get much better performance by using multiple accumulators to hide the latency of FP addition (as suggested by @Mysticial). Unroll 3 or 4 times with multiple "sum" vectors to bottleneck on load and FP-add throughput (one or two per clock cycle) instead of FP-add latency (one per 3 or 4 cycles):

__m128d vsum0 = _mm_setzero_pd();
__m128d vsum1 = _mm_setzero_pd();
for (int i = 0; i < n; i += 4)
{
    __m128d v0 = _mm_load_pd(&a[i]);
    __m128d v1 = _mm_load_pd(&a[i + 2]);
    vsum0 = _mm_add_pd(vsum0, v0);
    vsum1 = _mm_add_pd(vsum1, v1);
}
vsum0 = _mm_add_pd(vsum0, vsum1);    // vertical ops down to one accumulator
vsum0 = _mm_hadd_pd(vsum0, vsum0);   // horizontal add of the single register
double sum = _mm_cvtsd_f64(vsum0);

Note that the array a is assumed to be 16 byte aligned and the number of elements n is assumed to be a multiple of 2 (or 4, in the case of the unrolled loop).

See also Fastest way to do horizontal float vector sum on x86 for alternate ways of doing the horizontal sum outside the loop. SSE3 support is not totally universal (especially AMD CPUs were later to support it than Intel).

Also, _mm_hadd_pd is usually not the fastest way even on CPUs that support it, so an SSE2-only version won't be worse on modern CPUs. It's outside the loop and doesn't make much difference either way, though.

Community
  • 1
  • 1
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • I think this can benefit from unrolling at least 3 iterations. (3 separate `vsum` variables) – Mysticial Oct 01 '12 at 22:26
  • Yes, probably. You can let the compiler unroll it or perhaps do a better job by hand. Performance will probably be limited by memory bandwidth though, unless it's a relatively small data set that happens to be in cache, so micro-opimisations may not yield much benefit. – Paul R Oct 01 '12 at 22:30
  • I don't think the compiler is allowed to node-split since it breaks associativity. That said I haven't seen what it will do under relaxed floating-point. But I've never seen a compiler get too aggressive with optimizing SSE intrinsics. – Mysticial Oct 01 '12 at 22:32
  • Yes, compiler loop unrolling will probably just generate multiple loads and adds without introducing additional temporaries, so you get some benefits (e.g. from dual issue of loads), but not as much as from intelligent manual unrolling. I've added a 2x unrolled version to the answer now. – Paul R Oct 01 '12 at 22:35
  • 1
    `_mm_hadd_pd` is not an SSE2 intrinsic btw, it's SSE3. – harold Oct 02 '12 at 17:45
  • @harold: thanks for pointing that out - I'm assuming SSE3 is acceptable, since it's been around since 2004, but if the OP needs to support really old CPUs then I can probably come up with a workaround. I've updated the question anyway to make sure this is clear. – Paul R Oct 02 '12 at 17:55
  • I tidied this up. I suspect you would have said something more like this if you'd written this answer now instead of 4 years ago, but please edit if I got too technical in the multiple-accumulators paragraph. – Peter Cordes Dec 28 '16 at 00:53
  • @PeterCordes: thanks for the improvements Peter - much appreciated. Double precision is not a domain I normally work in, so my choice of intrinsics may well have been a bit amateur-ish. – Paul R Dec 28 '16 at 07:10