Difference in array sum between scalar code and SSE code

Question

I am implementing array sum reduction. The result of my SSE code does not match that of the scalar code.

Am I using SSE intrinsics wrongly? Am I misunderstanding some fundamental concepts of SIMD programing?

Thanks in advance for any help!

gcc -Wall -Wextra -pedantic -std=c11 -march=native -o sum sumSSE.c

//sumSSE.c
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <assert.h>

#include <immintrin.h>

//Scalar sum
float sc_sum(float *vec, uint32_t n)
{
    float sum = 0.0;
    for (uint32_t i = 0; i < n; i++)
        sum += vec[i];
    return sum;
}
//SSE sum
float li_sum(float *vec, uint32_t n)
{
    float sum = 0.0;
    float t[4];
    uint32_t i, n4 = n>>2<<2;
    __m128 v1;
    __m128 v2 = _mm_setzero_ps();
    for (i = 0; i < n4; i+=4) {
        v1 = _mm_loadu_ps(&vec[i]);
        v2 = _mm_add_ps(v2, v1);
    }
    for (; i < n; i++) sum += vec[i];
    _mm_storeu_ps(t, v2);
    sum += t[0] + t[1] + t[2] + t[3];
    return sum;
}

int main(int argc, char **argv)
{
    srand(time(NULL));
    if (argc < 2) return -1;

    uint32_t n = strtol(argv[1], NULL, 10);
    float *vec = malloc(n * sizeof(float));
    if (!vec) return -1;
    //Fill array with random numbers in range 0-1
    for (uint32_t i = 0; i < n; i++)
        vec[i] = (float)rand()/RAND_MAX;

    float r1 = sc_sum(vec, n);
    fprintf(stderr, "R1: %.5f\n", r1);

    float r2 = li_sum(vec, n);
    fprintf(stderr, "R2: %.5f\n", r2);
    
    fprintf(stdout, "r1-r2 == %.8f\n", r1-r2);
    free(vec);
    assert(r1 == r2);
}

Here is the absolute value of the difference between the scalar result and the SSE implementation.

Floating point addition is not associative -- there is a difference between `(...((v0+v1)+v2)+v3)+...)+vn` vs `(v0+v4+...)+(v1+v5+...)+...` — chtz, Mar 09 '23 at 11:17
Did you check which one was closer to the correct sum, without rounding to `float` at each step (e.g. use `double` or `long double`)? It's probably the SIMD version that was closer. — Peter Cordes, Mar 09 '23 at 11:50
@SimonGoater It should (essentially) make no difference here, since this is the only assignment to `sum` except for the initialization `sum = 0.0`. And `0.0 + x === x`, unless `x === -0.0`. — chtz, Mar 09 '23 at 12:48
@chtz There's assignment two lines above to add the last 1 to 3 elements of the array. — Simon Goater, Mar 09 '23 at 12:52
@SimonGoater Yes, you are right (I was glancing too superficially over the code) — chtz, Mar 09 '23 at 12:57
@SimonGoater nope, it definitely has to do with floating point addition not being associative. Working on changing the code to see if the problem can be circumvented. — jregalad, Mar 09 '23 at 13:13
The error is much smaller now though. Out of interest, I just tried modifying your scalar function to add up like the SSE one and it appears to give the same result, finishing successfully. — Simon Goater, Mar 09 '23 at 13:57
@jregalad: floating point is always a tradeoff between speed and precision; you can do something like Kahan summation or widening to `double` if you want to mostly sidestep it. If you want a scalar fallback that gives the same result as a manually-vectorized version, may use 8 scalar accumulators `float sum0=0, sum1=0, sum2=0, ...` to match the elements in `__m128 vsum0, vsum1`. (Multiple accumulators to hide some FP latency; better would be at least 8 vectors, if we're talking about large-ish arrays, but small enough that they're hot in L1d cache.) — Peter Cordes, Mar 09 '23 at 20:08

Difference in array sum between scalar code and SSE code

0 Answers0