4

I was trying to optimize following code (sum of squared differences for two arrays):

inline float Square(float value)
{
    return value*value;
}

float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
    float sum = 0;
    for(size_t i = 0; i < size; ++i)
        sum += Square(a[i] - b[i]);
    return sum;
}

So I performed optimization with using of SSE instructions of CPU:

inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum)
{
    __m128 _a = _mm_loadu_ps(a + i);
    __m128 _b = _mm_loadu_ps(b + i);
    __m128 _d = _mm_sub_ps(_a, _b);
    sum = _mm_add_ps(sum, _mm_mul_ps(_d, _d));
}

inline float ExtractSum(__m128 a)
{
    float _a[4];
    _mm_storeu_ps(_a, a);
    return _a[0] + _a[1] + _a[2] + _a[3];
}

float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
    size_t i = 0, alignedSize = size/4*4;
    __m128 sums = _mm_setzero_ps();
    for(; i < alignedSize; i += 4)
        SquaredDifferenceSum(a, b, i, sums);
    float sum = ExtractSum(sums);
    for(; i < size; ++i)
        sum += Square(a[i] - b[i]);
    return sum;
}

This code works fine if the size of the arrays is not too large. But if the size is big enough then there is a large computing error between results given by base function and its optimized version. And so I have a question: Where is here a bug in SSE optimized code, which leads to the computing error.

  • Can you post the generated assembly code as well? Spontaneously I'd say that the SSE version, using multiple accumulator chains, ought to be slightly more accurate however depending on the switches a compiler may well end up generating traditional x87 code in full 80-bit precision for the first loop without ever truncating along the way. In any event a recursive divide-and-conquer approach might be preferable to avoid long addition chains and unpredictable optimizer results if you're worried about precision. – doynax Aug 19 '15 at 14:36
  • I use -msse compiler option. So I think that x87 instructions won't be used . –  Aug 19 '15 at 15:06
  • 1
    By the way, you have a loop carried dependency chain on addps with a latency of 3c, but the reciprocal throughput of 2 loads, 1 sub, 1 add, 1 mul and some trivial loop overhead adds up to 2c on Haswell, so it should benefit from unrolling by 2 and using 2 accumulators. – harold Aug 19 '15 at 15:46

1 Answers1

7

The error follows from finite precision floating point numbers. Each addition of two floating point numbers is has an computing error proportional to difference between them. In your scalar version of algorithm the resulting sum is much greater then each term (if size of arrays is big enough of course). So it leads to accumulation of big computing error.

In the SSE version of algorithm actually there is four sums for results accumulation. And difference between these sums and each term is lesser in four times relative to scalar code. So this leads to the lesser computing error.

There are two ways to solve this error:

1) Using of floating point numbers of double precision for accumulating sum.

2) Using of the the Kahan summation algorithm (also known as compensated summation) which significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

With using of Kahan summation algorithm your scalar code will look like:

inline void KahanSum(float value, float & sum, float & correction)
{
    float term = value - correction;
    float temp = sum + term;
    correction = (temp - sum) - term;
    sum = temp; 
}

float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
    float sum = 0, correction = 0;
    for(size_t i = 0; i < size; ++i)
        KahanSum(Square(a[i] - b[i]), sum, correction);
    return sum;
}

And SSE optimized code will look as follow:

inline void SquaredDifferenceKahanSum(const float * a, const float * b, size_t i, 
                                      __m128 & sum, __m128 & correction)
{
    __m128 _a = _mm_loadu_ps(a + i);
    __m128 _b = _mm_loadu_ps(b + i);
    __m128 _d = _mm_sub_ps(_a, _b);
    __m128 term = _mm_sub_ps(_mm_mul_ps(_d, _d), correction);
    __m128 temp = _mm_add_ps(sum, term);
    correction = _mm_sub_ps(_mm_sub_ps(temp, sum), term);
    sum = temp; 
}

float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
    size_t i = 0, alignedSize = size/4*4;
    __m128 sums = _mm_setzero_ps(), corrections = _mm_setzero_ps();
    for(; i < alignedSize; i += 4)
        SquaredDifferenceKahanSum(a, b, i, sums, corrections);
    float sum = ExtractSum(sums), correction = 0;
    for(; i < size; ++i)
        KahanSum(Square(a[i] - b[i]), sum, correction);
    return sum;
}
ErmIg
  • 3,980
  • 1
  • 27
  • 40