Is this way of processing the tail of an array with SSE overkill?

Question

I was messing around with SSE, trying to write a function that would add up all the values of a single-precision floating-point array. I wanted it to work for all lengths of the array, not just ones that are a multiple of 4, as assumed in virtually all examples on the web. I came up with something like this:

float sse_sum(const float *x, const size_t n)
{
    const size_t
        steps = n / 4,
        rem = n % 4,
        limit = steps * 4;

    __m128 
        v, // vector of current values of x
        sum = _mm_setzero_ps(0.0f); // sum accumulator

    // perform the main part of the addition
    size_t i;
    for (i = 0; i < limit; i+=4)
    {
        v = _mm_load_ps(&x[i]);
        sum = _mm_add_ps(sum, v);
    }

    // add the last 1 - 3 odd items if necessary, based on the remainder value
    switch(rem)
    {
        case 0: 
            // nothing to do if no elements left
            break;
        case 1: 
            // put 1 remaining value into v, initialize remaining 3 to 0.0
            v = _mm_load_ss(&x[i]);
            sum = _mm_add_ps(sum, v);
            break;
        case 2: 
            // set all 4 to zero
            v = _mm_setzero_ps();
            // load remaining 2 values into lower part of v
            v = _mm_loadl_pi(v, (const __m64 *)(&x[i]));
            sum = _mm_add_ps(sum, v);
            break;
        case 3: 
            // put the last one of the remaining 3 values into v, initialize rest to 0.0
            v = _mm_load_ss(&x[i+2]);
            // copy lower part containing one 0.0 and the value into the higher part
            v = _mm_movelh_ps(v,v);
            // load remaining 2 of the 3 values into lower part, overwriting 
            // old contents                         
            v = _mm_loadl_pi(v, (const __m64*)(&x[i]));     
            sum = _mm_add_ps(sum, v);
            break;
    }

    // add up partial results
    sum = _mm_hadd_ps(sum, sum);
    sum = _mm_hadd_ps(sum, sum);
    __declspec(align(16)) float ret;
    /// and store the final value in a float variable
    _mm_store_ss(&ret, sum);
    return ret; 
}

Then I started wondering if this wasn't overkill. I mean, I got stuck in SIMD mode and just had to have the tail processed with SSE as well. It was fun, but isn't it just as good (and simpler) to add up the tail and calculate the result using regular float operations? Am I gaining anything by doing it in SSE?

There's a whole bunch of ways to do this - some uglier than others. But if the array is large enough, it doesn't really matter which you use (aside from readability) since the overhead should be small. — Mysticial, Mar 19 '13 at 23:21
Well, look at it this way. What percentage of your array is in the tail? How much speed-up do you get moving from standard floating-point to SSE? Then apply [Amdahl's Law](http://en.wikipedia.org/wiki/Amdahl%27s_Law). Then balance that number against the fact that you've made your code significantly uglier ;) — Oliver Charlesworth, Mar 19 '13 at 23:22
Note that a full implementation of this idea must deal with unaligned inputs (thus "remainders" at the beginning) as well. — Ben Jackson, Mar 19 '13 at 23:24
Typically when you manually write SIMD code it's for performance gain, so you tell us: are you gaining anything? Did it run faster than processing the remainder without SIMD? Up to you to decide. — GManNickG, Mar 19 '13 at 23:25
Since I'm operating on bunches of 4 floats at a time, the tail is always 3 elements long at most. I also assume aligned input because my data will come from buffers allocated by a library that aligns them (OpenCV pixel values). I will do some measurements and update on Q. — neuviemeporte, Mar 19 '13 at 23:46
@Mystical: can you outline an example approach (preferably a less ugly one)? — neuviemeporte, Mar 19 '13 at 23:48
Defaulting to a sequential loop is the cleanest way. Other (not so clean) methods include [Duff's Device](http://en.wikipedia.org/wiki/Duff%27s_device) or a binary reducing set of if-statements. — Mysticial, Mar 19 '13 at 23:58
Unless the array is expected to be small and performance is absolutely critical, do whatever is cleanest, as Mystical suggests. It's a tiny constant factor, the definition of polishing the underside of the bannister. — Stephen Canon, Mar 20 '13 at 00:22

score 5 · Answer 1 · answered Mar 20 '13 at 15:29

I would check out Agner Fog's vectorclass. See the section "When the data size is not a multiple of the vector size" of the VectorClass.pdf. He lists five different ways of doing this and discuses the pros and cons of each. http://www.agner.org/optimize/#vectorclass

In general, the way I do this I got from the following link. http://fastcpp.blogspot.no/2011/04/how-to-unroll-loop-in-c.html

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
 void add(float* result, const float* a, const float* b, int N) {
 int i = 0;
 for(; i < ROUND_DOWN(N, 4); i+=4) {
    __m128 a4 = _mm_loadu_ps(a + i);
    __m128 b4 = _mm_loadu_ps(b + i);
    __m128 sum = _mm_add_ps(a4, b4);
    _mm_storeu_ps(result + i, sum);
  }
  for(; i < N; i++) {
      result[i] = a[i] + b[i];
  }
}

This isn't a reduction, so if the array is at least 4 elements, and `result` doesn't overlap one of the inputs, you can do a final unaligned vector that might partially overlap the previous vector. `_mm_loadu_ps(a + N-4)` ... `_mm_storeu_ps(result + N-4, final)`. That avoids any branching based on `N%4`. But the problem in the question is reduction, `sum += a[i]`, which means you have to avoid double-counting elements. — Peter Cordes, Oct 30 '22 at 03:10

neuviemeporte · Answer 2 · 2013-03-21T12:51:21.470

As promised, I did some benchmark tests. To this end, I _aligned_malloc'd a float array of size 100k, populated it with the single value of 1.123f and tested the functions against that. I wrote a naive summation function that just accumulated the result in a loop. Next, I made a simplified SSE summation function variant with the horizontal and tail additions done with regular floats:

float sseSimpleSum(const float *x, const size_t n)
{
    /* ... Everything as before, but instead of hadd: */

    // do the horizontal sum on "sum", which is a __m128 by hand
    const float *sumf = (const float*)(&sum);
    float ret = sumf[0] + sumf[1] + sumf[2] + sumf[3];

    // add up the tail
    for (; i < n; ++i)
    {
        ret += x[i];
    }

    return ret; 
}

I got no performance hit, it even seemed marginally faster at times, but I find the timer quite unreliable, so let's just assume the simplified variant is equivalent to the convoluted one. What was surprising though, was the rather large difference in values obtained from the SSE and naive float summation functions. I suspected this was due to error accumulation from rounding, so I wrote a function based on the Kahan algorithm, which gave the correct result, although way slower than the naive float addition. For the sake of completeness, I made an SSE Kahan-based function along these lines:

float SubsetTest::sseKahanSum(const float *x, const size_t n)
{
    /* ... init as before... */

    __m128 
        sum = _mm_setzero_ps(), // sum accumulator
        c   = _mm_setzero_ps(), // correction accumulator
        y, t;

    // perform the main part of the addition
    size_t i;
    for (i = 0; i < limit; i+=4)
    {
        y = _mm_sub_ps(_mm_load_ps(&x[i]), c);
        t = _mm_add_ps(sum, y);
        c = _mm_sub_ps(_mm_sub_ps(t, sum), y);
        sum = t;
    }

    /* ... horizontal and tail sum as before... */
}

Here are the benchmark results, obtained from VC++2010 in Release mode which show the obtained value of the sum, the time it took to compute and the amount of error in relation to the correct value:

Kahan: value = 112300, time = 1155, error = 0
Float: value = 112328.78125, time = 323, error = 28.78125
SSE: value = 112304.476563, time = 46, error = 4.4765625
Simple SSE: value = 112304.476563, time = 45, error = 4.4765625
Kahan SSE: value = 112300, time = 167, error = 0

The amount of error on the naive float addition is huge! I suspect the non-Kahan SSE functions are more accurate due to the fact that they amount to Pairwise summation which can yield accuracy improvements over the straightforward approach. The Kahan SSE is accurate, but only about twice as fast as the naive float addition.

Kahan summation lengthens the loop-carried dependency chain and makes it even more valuable to unroll with multiple accumulators. Like `sum1` and `sum2`, `t1` and `t2`, etc. Or 4 sets. Then add them at the end. But the same idea would help for the normal SSE loop, as in [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/a/45114487) (FP dot product, where what you're adding is a multiply result, but taking an array value directly isn't different in terms of dep chains). — Peter Cordes, Oct 30 '22 at 03:23
For example, Skylake has `addps` with 4 cycle latency, 2/clock throughput, so 8 vector adds can be in flight at once. So you need to unroll with at least 8 accumulators to get that factor of 8 speedup (if your data is hot in L1d cache or maybe L2, otherwise you'll bottleneck on bandwidth). This, more than cleanup, is where the real speedup is for arrays that aren't tiny. But a big unroll makes cleanup more important, like you might wind down to a loop that goes 1 vector at a time before going scalar. — Peter Cordes, Oct 30 '22 at 03:27
And yes, even one SIMD vector gives you 4 accumulators, a step in the direction of pairwise summation. [Efficient stable sum of a sorted array in AVX2](https://stackoverflow.com/q/46119811) / [Simd matmul program gives different numerical results](https://stackoverflow.com/q/55477701) — Peter Cordes, Oct 30 '22 at 03:39

score 1 · Answer 3 · answered Mar 20 '13 at 00:19

1

In this case it may be overkill unless you can point to some real performance gains. If you are using gcc then this guide on Auto-vectorization with gcc 4.7 may be a nice alternative, although it will clearly be gcc specific it is not as ugly as intrinsics.

answered Mar 20 '13 at 00:19

Shafik Yaghmour

154,301
39
440
740

Is this way of processing the tail of an array with SSE overkill?

3 Answers3

Linked