Improving the accuracy in SIMD summation

Question

I am investigating the performance of SIMD summation in terms of speed and accuracy. To conduct this analysis, let's consider a scenario where I have multiple 1D arrays of doubles combined into a large 1D array, stored contiguously in memory. My goal is to efficiently compute the sum of the 1D arrays of doubles in a vectorized manner using C++ language. However, I have concerns regarding the accuracy of the results. Could you please provide suggestions on how to enhance the accuracy of the computed results? Additionally, I would greatly appreciate any recommendations on how to measure the speedup achieved in this process.

#include <iostream>
#include <numeric> // Include the numeric header for sequential summation
#include <cmath>   // Include the cmath header for fabs
#include <vector>  // Include the vector header for dynamic arrays

#if defined(__AVX__)
#include <immintrin.h> // Include the appropriate SIMD header file for AVX
#elif defined(__SSE__)
#include <emmintrin.h> // Include the appropriate SIMD header file for SSE
#else
#error "Unsupported SIMD instruction set"
#endif

template <typename T>
class VectorizedSummation {
public:
    VectorizedSummation(const T* data, size_t numElements)
        : data_(data), numElements_(numElements) {}

    double computeSum() {
        size_t simdWidth = sizeof(__m256d) / sizeof(T); // Determine the SIMD width based on the target SIMD instruction set (e.g., AVX)

        // Initialize the accumulator with zeros
        __m256d sum = _mm256_setzero_pd();

        // Process the data in a vectorized manner
        for (size_t i = 0; i < numElements_; i += simdWidth)
        {
            // Load a SIMD vector from memory
            __m256d vec = _mm256_load_pd(reinterpret_cast<const double*>(&data_[i]));

            // Add the SIMD vector to the accumulator
            sum = _mm256_add_pd(sum, vec);
        }

        // Sum the SIMD accumulator into a scalar double
        std::vector<double> result(simdWidth);
        _mm256_storeu_pd(result.data(), sum);
        // Result is what I really want to return. 
        // computing totalSum is for testing purposes and is likely to be removed
        double totalSum = 0.0;
        for (size_t i = 0; i < simdWidth; ++i) {
            totalSum += result[i];
        }

        return totalSum;
    }

private:
    const T* data_;
    size_t numElements_;
};

int main()
{
    const size_t numElements = 1000000;
    double* doubleData = static_cast<double*>(aligned_alloc(32, numElements * sizeof(double))); // Aligned memory for SIMD operations
    float* floatData = static_cast<float*>(aligned_alloc(32, numElements * sizeof(float)));      // Aligned memory for SIMD operations

    // Initialize the double and float arrays with random values
    for (size_t i = 0; i < numElements; ++i)
    {
        doubleData[i] = static_cast<double>(rand()) / RAND_MAX;
        floatData[i] = static_cast<float>(rand()) / RAND_MAX;
    }

    // Sum the double array using vectorized summation
    VectorizedSummation<double> doubleSummation(doubleData, numElements);
    double doubleSum = doubleSummation.computeSum();

    // Compute the sequential sum of the double array
    double sequentialDoubleSum = std::accumulate(doubleData, doubleData + numElements, 0.0);

    // Verify the correctness of the double summation
    double tolerance = 1e-4;
    bool doubleSumCorrect = std::fabs(doubleSum - sequentialDoubleSum) < tolerance;
    std::cout << "Double Sum Verification: " << (doubleSumCorrect ? "Passed" : "Failed") << std::endl;

    // Clean up the allocated memory
    free(doubleData);

    // Sum the float array using vectorized summation
    VectorizedSummation<float> floatSummation(floatData, numElements);
    double floatSum = floatSummation.computeSum();

    // Compute the sequential sum of the float array
    double sequentialFloatSum = std::accumulate(floatData, floatData + numElements, 0.0);

    // Verify the correctness of the float summation
    bool floatSumCorrect = std::fabs(floatSum - sequentialFloatSum) < tolerance;
    std::cout << "Float Sum Verification: " << (floatSumCorrect ? "Passed" : "Failed") << std::endl;

    // Clean up the allocated memory
    free(floatData);

    return 0;
}

why do you have concerns? Do you get the wrong answer? How accurate do you need the result to be? — Alan Birtles, May 26 '23 at 17:22
Do the numbers have different signs? Especially then it makes sense to add in a tree-like manner instead of linearly not to loose precision. — Sebastian, May 26 '23 at 18:37
It's not as if `std::accumulate` is necessarily *good* either, even if you get a different result from the SIMD code it may actually be better — harold, May 26 '23 at 19:13
Classical way of reducing error is [Kahan summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm). — gudok, May 26 '23 at 19:14
re: the title question about precision: see [Simd matmul program gives different numerical results](https://stackoverflow.com/q/55477701) - that's normal. The SIMD sum is probably *closer* to the exact result (which you could get with Kahan summation or `long double` to compare different methods against). And see [Efficient stable sum of a sorted array in AVX2](https://stackoverflow.com/q/46119811) . A simple scalar loop like `sum += arr[i]` is usually the worst way to sum an array, so the vectorized sum having different rounding error is usually *less* error, closer to pairwise summation. — Peter Cordes, May 27 '23 at 03:16
Your edit turned it into a duplicate of your other question [Am I getting a reasonable speed-up for vectorized addition of the rows and columns of a 2D array?](https://stackoverflow.com/q/76344978) — Peter Cordes, May 27 '23 at 03:18

ALX23z · Answer 1 · 2023-05-26T19:35:52.387

The question is too general and based on the description and code you seem to lack knowledge of basic principles. You should try some numerical analysis courses; there should be some online or you could just read a book on the subject.

I am not sure what kind of accuracy problem you are facing with the SIMD summation. But the most definite wrong way to test it is by comparing it with sequential summation, as sequential summation is a less accurate method compared to vectorized summation.

The problem is that when adding two floating points, some data is truncated in the result. The bigger difference in magnitude there is between the two numbers the larger numerical error result will have. An extreme example is summing 2^32 floats all of which are equal to 1. Obviously the answer is 2^32 but I believe the computation results in 2^24 via direct summation while 4-vectorized summation will result in 2^26. Checkout definition of float so you can figure out why it happens.

I am sure you can find various algorithms online that improve numerical accuracy of floating points summation if that is what you are looking for. Or ask a more precise question once you better understand what you want.

I changed my question and the code that comes with it, hoping it allows for better understanding. — user3116936, May 27 '23 at 01:00
@user3116936 please don't do that after you've already received answers that would be insulated by your edits — Alan Birtles, May 27 '23 at 06:19

Improving the accuracy in SIMD summation

1 Answers1