I was messing around with SSE, trying to write a function that would add up all the values of a single-precision floating-point array. I wanted it to work for all lengths of the array, not just ones that are a multiple of 4, as assumed in virtually all examples on the web. I came up with something like this:
float sse_sum(const float *x, const size_t n)
{
const size_t
steps = n / 4,
rem = n % 4,
limit = steps * 4;
__m128
v, // vector of current values of x
sum = _mm_setzero_ps(0.0f); // sum accumulator
// perform the main part of the addition
size_t i;
for (i = 0; i < limit; i+=4)
{
v = _mm_load_ps(&x[i]);
sum = _mm_add_ps(sum, v);
}
// add the last 1 - 3 odd items if necessary, based on the remainder value
switch(rem)
{
case 0:
// nothing to do if no elements left
break;
case 1:
// put 1 remaining value into v, initialize remaining 3 to 0.0
v = _mm_load_ss(&x[i]);
sum = _mm_add_ps(sum, v);
break;
case 2:
// set all 4 to zero
v = _mm_setzero_ps();
// load remaining 2 values into lower part of v
v = _mm_loadl_pi(v, (const __m64 *)(&x[i]));
sum = _mm_add_ps(sum, v);
break;
case 3:
// put the last one of the remaining 3 values into v, initialize rest to 0.0
v = _mm_load_ss(&x[i+2]);
// copy lower part containing one 0.0 and the value into the higher part
v = _mm_movelh_ps(v,v);
// load remaining 2 of the 3 values into lower part, overwriting
// old contents
v = _mm_loadl_pi(v, (const __m64*)(&x[i]));
sum = _mm_add_ps(sum, v);
break;
}
// add up partial results
sum = _mm_hadd_ps(sum, sum);
sum = _mm_hadd_ps(sum, sum);
__declspec(align(16)) float ret;
/// and store the final value in a float variable
_mm_store_ss(&ret, sum);
return ret;
}
Then I started wondering if this wasn't overkill. I mean, I got stuck in SIMD mode and just had to have the tail processed with SSE as well. It was fun, but isn't it just as good (and simpler) to add up the tail and calculate the result using regular float operations? Am I gaining anything by doing it in SSE?