Increased floating point error when using SIMD intrinsics (or maybe even the other way around)

Question

I've written a function which operates on a matrix of 3D vectors, essentially computing the difference between the vectors represented by the blue crosses for each iterated point represented by the red cross (see diagram), and accumulating the dot products of these differences with themselves in an output variable. To be precise, the vector at the iterated point is subtracted from the vector at the cell below, times a constant. Further, the value that is accumulated is the dot product multiplied with another constant.

The following is the minimum reproducer. There is a naive variant which relies on auto-vectorization, and there is an optimized one which uses AVX2 intrinsics and additionally performs loop blocking for better cache access.

#include <iostream>
#include <iomanip>
#include <memory>
#include <random>

#include <immintrin.h>


//+//////////////////////////////////////////////
// data structures
//+//////////////////////////////////////////////

class v3
{
    double data_[3] = {};
    
public:
    
    double& operator()(const std::size_t index)
    {
        return data_[index];
    }
    
    const double& operator()(const std::size_t index) const
    {
        return data_[index];
    }
    
    v3 operator-(const v3& op) const
    {
        auto ret_val = v3{};
        ret_val(0) = (*this)(0) - op(0);
        ret_val(1) = (*this)(1) - op(1);
        ret_val(2) = (*this)(2) - op(2);
        
        return ret_val;
    }
    
    v3 operator*(const double& op) const
    {
        auto ret_val = v3{};
        ret_val(0) = (*this)(0) * op;
        ret_val(1) = (*this)(1) * op;
        ret_val(2) = (*this)(2) * op;
        
        return ret_val;
    }
};

class matrix_v3
{
    std::size_t n1_;
    std::size_t n2_;
    std::unique_ptr<v3> positions_;
    
public:
    
    matrix_v3(const std::size_t& n1, const std::size_t& n2)
        : n1_(n1), n2_(n2), positions_(new v3[n1 * n2])
    {
        if (n1_ < 1 || n2_ < 1)
        {
            throw std::invalid_argument("matrix_v3_ctor_invalid_sizes_error");
        }
    }
    
    std::size_t n1() const
    {
        return n1_;
    }
    
    std::size_t n2() const
    {
        return n2_;
    }
    
    v3& position(const std::size_t& i1, const std::size_t& i2)
    {
        if (n1_ <= i1 || n2_ <= i2)
        {
            throw std::invalid_argument("matrix_v3_position_invalid_indices_error");
        }
        return positions_.get()[i1 * n2_ + i2];
    }
    
    const v3& position(const std::size_t& i1, const std::size_t& i2) const
    {
        return const_cast<matrix_v3*>(this)->position(i1, i2);
    }
    
    double* stream(const std::size_t& i1)
    {
        if (n1_ <= i1)
        {
            throw std::invalid_argument("matrix_v3_stream_invalid_index_error");
        }
        return reinterpret_cast<double*>(positions_.get() + i1 * n2_);
    }
    
    const double* stream(const std::size_t& i1) const
    {
        return const_cast<matrix_v3*>(this)->stream(i1);
    }
};


//+//////////////////////////////////////////////
// functions
//+//////////////////////////////////////////////

double auto_vec_variant(const matrix_v3& mat, const double& k1, const double& k2)
{
    // output
    auto accumulator = double{};
    
    // loop
    const auto k2_b_2 = k2 * 0.5;
    for (auto i1 = std::size_t{}; i1 < mat.n1() - 1; ++i1)
    {
        for (auto i2 = std::size_t{}; i2 < mat.n2(); ++i2)
        {
            auto lambda = mat.position(i1 + 1, i2) - mat.position(i1, i2) * k1;
            accumulator += k2_b_2 * (lambda(0) * lambda(0) + lambda(1) * lambda(1) + lambda(2) * lambda(2));
        }
    }
    
    return accumulator;
}

// 9 MiB L3 cache
#define VVI_LLD_CACHE_SIZE_MB 9

double intrinsics_variant(const matrix_v3& mat, const double& k1, const double& k2)
{
    // inner loop constants
    static constexpr auto unroll_count = 3;
    static constexpr auto register_width = 4;
    static constexpr auto increment = register_width * unroll_count;
    
    // loop blocking details
    constexpr auto lld_cache_size = VVI_LLD_CACHE_SIZE_MB * 1024 * 1024 / (sizeof(double) * 4); // 4 comes from actual application
    const auto block_size = lld_cache_size / (mat.n1() - 1) / increment * increment; // must be divisible by increment!
    const auto block_size_in_v3 = block_size / 3;
    const auto rstream_size_div = mat.n2() * 3 / block_size * block_size;
    
    // packed constants
    const auto k2_b_2 = k2 * 0.5;
    const auto k1_p = _mm256_broadcast_sd(&k1);
    const auto k2_b_2_p = _mm256_broadcast_sd(&k2_b_2);
    
    // output
    auto accumulator = double{};
    
    // let's go!
    auto particle_index = std::size_t{};
    for (auto start_index = std::size_t{}; start_index < rstream_size_div; start_index += block_size,
        particle_index += block_size_in_v3)
    {
        for (auto i1 = std::size_t{}; i1 < mat.n1() - 1; ++i1)
        {
            // real valued streams
            const auto stream1 = mat.stream(i1);
            const auto stream2 = mat.stream(i1 + 1);
            
            // loop
            const auto end_index = start_index + block_size;
            for (auto index = start_index; index < end_index; index += increment)
            {
                // load x(i) and x(i)_r
                const auto x_0 = _mm256_loadu_pd(stream1 + index);
                const auto x_r_0 = _mm256_loadu_pd(stream2 + index);
                
                const auto x_1 = _mm256_loadu_pd(stream1 + index + 4);
                const auto x_r_1 = _mm256_loadu_pd(stream2 + index + 4);
                
                const auto x_2 = _mm256_loadu_pd(stream1 + index + 8);
                const auto x_r_2 = _mm256_loadu_pd(stream2 + index + 8);
                
                // subtraction and FMA (results in lambda)
                const auto l0_0 = _mm256_fnmadd_pd(x_0, k1_p, x_r_0);
                const auto l0_1 = _mm256_fnmadd_pd(x_1, k1_p, x_r_1);
                const auto l0_2 = _mm256_fnmadd_pd(x_2, k1_p, x_r_2);
                
                // shuffles x5
                const auto l1_0 = _mm256_shuffle_pd(l0_0, l0_1, 0b00001011);
                const auto l1_1 = _mm256_shuffle_pd(l0_1, l0_2, 0b00000010);
                const auto l2_0 = _mm256_shuffle_pd(l0_0, l1_1, 0b00000110);
                const auto y = _mm256_shuffle_pd(l1_0, l1_1, 0b00001100);
                const auto l2_2 = _mm256_shuffle_pd(l1_0, l0_2, 0b00001001);
                
                // rearrange (expensive ops)
                const auto x = _mm256_permute4x64_pd(l2_0, 0b01111000); // 1,3,2,0
                const auto z = _mm256_permute4x64_pd(l2_2, 0b11010010); // 3,1,0,2
                
                // arithmetics
                const auto xx = _mm256_mul_pd(x, x);
                const auto xx_p_yy = _mm256_fmadd_pd(y, y, xx);
                const auto dot_product = _mm256_fmadd_pd(z, z, xx_p_yy);
                
                // accumulate (FMA)
                auto pe = __m256d{};
                pe = _mm256_fmadd_pd(dot_product, k2_b_2_p, pe);
                
                // horizontal sum of packed dot product
                const auto pe_high = _mm256_extractf128_pd(pe, 1);
                const auto pe_low = _mm256_castpd256_pd128(pe);
                const auto sum_128 = _mm_add_pd(pe_low, pe_high);
                const auto sum_128_1 = _mm_shuffle_pd(sum_128, sum_128, 0b00000001); // 0, 1
                const auto sum_64 = _mm_add_pd(sum_128, sum_128_1);
                accumulator += _mm_cvtsd_f64(sum_64);
            }
        }
    }
    
    // outer remainder loop (no optimization necessary here)
    for (auto i1 = std::size_t{}; i1 < mat.n1() - 1; ++i1)
    {
        for (auto i2 = particle_index; i2 < mat.n2(); ++i2)
        {
            auto lambda = mat.position(i1 + 1, i2) - mat.position(i1, i2) * k1;
            accumulator += k2_b_2 * (lambda(0) * lambda(0) + lambda(1) * lambda(1) + lambda(2) * lambda(2));
        }
    }
    
    return accumulator;
}


//+//////////////////////////////////////////////
// tester
//+//////////////////////////////////////////////

bool test_functionality()
{
    const auto L = std::size_t{30};
    const auto N = std::size_t{999460};
    auto mat = matrix_v3(L, N);
    
    // init
    auto engine = std::mt19937_64(std::random_device()());
    auto dist = std::uniform_real_distribution<double>(-5.0, 5.0);
    for (auto i1 = std::size_t{}; i1 < mat.n1(); ++i1)
    {
        for (auto i2 = std::size_t{}; i2 < mat.n2(); ++i2)
        {
            auto& position = mat.position(i1, i2);
            position(0) = std::floor(dist(engine));
            position(1) = std::floor(dist(engine));
            position(2) = std::floor(dist(engine));
        }
    }
    
    // compute reference
    auto accumulator_ref = auto_vec_variant(mat, 0.1, 0.4);
    
    // compute test
    auto accumulator_test = intrinsics_variant(mat, 0.1, 0.4);
    
    // test
    const auto error = std::fabs(accumulator_ref - accumulator_test);
    if (error > 1E-15)
    {
        std::cout << std::fixed << std::setprecision(15) << "Test failed. Error: " << error << "." << std::endl;
        return false;
    }
    
    return true;
}


int main()
{
    // conduct the test
    if (test_functionality())
    {
        return 0;
    }
    
    return 1;
}

I'm compiling the above with the GNU compiler v11.1.0 and flags "-O3;-mfma;-mavx2". The test fails with an error ~0.000006318092346. As can be seen in the code, the use of std::floor ensures that only integer-valued floating point numbers are involved in the computation.

Questions:

Why is there such a large difference in the results of the two variants?
Which variant should be more accurate?
If the intrinsics-variant is less accurate, how can its accuracy be improved?

Additional information:

Using simple incrementing, I counted how many additions are performed for the accumulation in each case. The naive variant performs 28984340 additions and the optimized one, 21738255 additions.

Thanks in advance, looking forward to hearing your thoughts.

Update:

Following optimization suggestions by @PeterCordes, I implemented a more efficient variant of the intrinsics-variant:

double intrinsics_variant(const matrix_v3& mat, const double& k1, const double& k2)
{
    // inner loop constants
    static constexpr auto unroll_count = 3;
    static constexpr auto register_width = 4;
    static constexpr auto increment = register_width * unroll_count;
    
    // loop blocking details
    constexpr auto lld_cache_size = VVI_LLD_CACHE_SIZE_MB * 1024 * 1024 / (sizeof(double) * 4); // 4 comes from actual application
    const auto block_size = lld_cache_size / (mat.n1() - 1) / increment * increment; // must be divisible by increment!
    const auto block_size_in_v3 = block_size / 3;
    const auto rstream_size_div = mat.n2() * 3 / block_size * block_size;
    
    // packed constants
    const auto k2_b_2 = k2 * 0.5;
    const auto k1_p = _mm256_broadcast_sd(&k1);
    const auto k2_b_2_p = _mm256_broadcast_sd(&k2_b_2);
    
    // packed accumulators
    auto accumulator_p_0 = __m256d{};
    auto accumulator_p_1 = __m256d{};
    auto accumulator_p_2 = __m256d{};
    
    // let's go!
    auto particle_index = std::size_t{};
    for (auto start_index = std::size_t{}; start_index < rstream_size_div; start_index += block_size,
        particle_index += block_size_in_v3)
    {
        for (auto i1 = std::size_t{}; i1 < mat.n1() - 1; ++i1)
        {
            // real valued streams
            const auto stream1 = mat.stream(i1);
            const auto stream2 = mat.stream(i1 + 1);
            
            // loop
            const auto end_index = start_index + block_size;
            for (auto index = start_index; index < end_index; index += increment)
            {
                // load x(i) and x(i)_r
                const auto x_0 = _mm256_loadu_pd(stream1 + index);
                const auto x_r_0 = _mm256_loadu_pd(stream2 + index);
                
                const auto x_1 = _mm256_loadu_pd(stream1 + index + 4);
                const auto x_r_1 = _mm256_loadu_pd(stream2 + index + 4);
                
                const auto x_2 = _mm256_loadu_pd(stream1 + index + 8);
                const auto x_r_2 = _mm256_loadu_pd(stream2 + index + 8);
                
                // subtraction and FMA (results in lambda)
                const auto lambda_0 = _mm256_fnmadd_pd(x_0, k1_p, x_r_0);
                const auto lambda_1 = _mm256_fnmadd_pd(x_1, k1_p, x_r_1);
                const auto lambda_2 = _mm256_fnmadd_pd(x_2, k1_p, x_r_2);
                
                // dot product
                const auto lambda_0_sq = _mm256_mul_pd(lambda_0, lambda_0);
                const auto lambda_1_sq = _mm256_mul_pd(lambda_1, lambda_1);
                const auto lambda_2_sq = _mm256_mul_pd(lambda_2, lambda_2);
                
                // accumulate in packed register
                accumulator_p_0 = _mm256_fmadd_pd(lambda_0_sq, k2_b_2_p, accumulator_p_0);
                accumulator_p_1 = _mm256_fmadd_pd(lambda_1_sq, k2_b_2_p, accumulator_p_1);
                accumulator_p_2 = _mm256_fmadd_pd(lambda_2_sq, k2_b_2_p, accumulator_p_2);
            }
        }
    }
    
    // sum up the unrolled accumulators
    const auto accumulator_p_intermediate = _mm256_add_pd(accumulator_p_0, accumulator_p_1);
    const auto accumulator_p = _mm256_add_pd(accumulator_p_intermediate, accumulator_p_2);
    
    // horizontal sum of packed dot product
    const auto high = _mm256_extractf128_pd(accumulator_p, 1);
    const auto low = _mm256_castpd256_pd128(accumulator_p);
    const auto sum_128 = _mm_add_pd(low, high);
    const auto sum_128_1 = _mm_shuffle_pd(sum_128, sum_128, 0b00000001); // 0, 1
    const auto sum_64 = _mm_add_pd(sum_128, sum_128_1);
    auto accumulator = _mm_cvtsd_f64(sum_64);
    
    // outer remainder loop (no optimization necessary here)
    for (auto i1 = std::size_t{}; i1 < mat.n1() - 1; ++i1)
    {
        for (auto i2 = particle_index; i2 < mat.n2(); ++i2)
        {
            auto lambda = mat.position(i1 + 1, i2) - mat.position(i1, i2) * k1;
            accumulator += k2_b_2 * (lambda(0) * lambda(0) + lambda(1) * lambda(1) + lambda(2) * lambda(2));
        }
    }
    
    return accumulator;
}

The above consistently shows an even greater deviation of ~0.001716434955597.

It looks like the initial subtraction is the same for both, and you're doing a huge amount of shuffling to do the adds in the same order as scalar source order? Except reducing to scalar for groups of elements before adding to `accumulator`, so it's slightly closer to pairwise summation. As in [Simd matmul program gives different numerical results](https://stackoverflow.com/q/55477701) . GCC defaults to `-mfp-contract=fast` so it should be using FMAs for `accumulator += eta(0) * eta(0) + eta(1) * eta(1) + eta(2) * eta(2)`, but maybe in a different order than your source? — Peter Cordes, Feb 22 '23 at 21:19
I didn't really try to follow all the shuffles in my head. BTW, the 4x 2-bit element shuffles can use the `_MM_SHUFFLE(3,1,0,2)` macro from immintrin.h. BTW, it would be *much* more efficient to `_mm256_add_pd` (or FMA) into a vector accumulator and do the horizontal sum at the end. Or unroll summing into 2 or more vector accumulators. Since you're just summing across vectors and across x,y,z components within vectors, I think this is equivalent to sum-of-squares of subtraction of two flat arrays, no need to consider x,y,z components at all. (Except for rounding error.) — Peter Cordes, Feb 22 '23 at 21:19
Anyway, no reason at all to expect your simplistic scalar version to actually be exact; if your vector version is correct with simple round numbers, it's just different rounding error, could easily be less bad than the scalar version, especially if you haven't tried extended-precision or Kahan summation for the scalar sum. You're not using `-ffast-math` so GCC won't break Kahan summation for you. — Peter Cordes, Feb 22 '23 at 21:22
@PeterCordes the shuffling I'm doing there converts from array-of-structures into structure-of-arrays. It's a transposition to make use of vertical sums into horizontal ones. Since I'm ending up with a packed vector of dot products, I still have to do a horizontal sum to reduce it to one double. Isn't the _MM_SHUFFLE expensive? — Nitin Malapally, Feb 22 '23 at 21:52
Your functions both have a single scalar `double` as an output, the sum of a bunch of squared values. You're reducing to scalar way earlier than necessary. (And it's weird that you take it by reference instead of just returning a double as the return value, since your functions start by zeroing it.) — Peter Cordes, Feb 22 '23 at 21:55
`_MM_SHUFFLE` is just a macro for writing 8-bit constants that have four 2-bit fields. You can use it for the final arg to `_mm256_permute4x64_pd` exactly the same way you'd use it for the final arg to `_mm_shuffle_epi32` or `_mm_shuffle_ps`. — Peter Cordes, Feb 22 '23 at 21:57
@PeterCordes I get it, you're saying I can do the horizontal sum outside the loop. It's a good idea. — Nitin Malapally, Feb 22 '23 at 21:57
@PeterCordes based on what you said, how can we sensibly test our optimized function then, and how do we sensibly choose a tolerance? — Nitin Malapally, Feb 22 '23 at 22:05
Yes, you can sink at least the horizontal sum. I suspect you can simplify much more, since you don't need to have a `__m256d` accumulator of x values, y values, and z values. Each of its elements can be from a mix of x y and z because you're eventually summing it all together anyway. If you look at what you're doing to the individual double elements of your arrays of `v3` elements, `square(a[0][0] - b[0][0]) + square(a[0][1] - b[0][1]) + ... + square(a[n-1][2] - b[n-1][2])`, that's equivalent to flat arrays of doubles summing `square(aflat[i]-bflat[i])` over all elements. — Peter Cordes, Feb 22 '23 at 22:06
*how can we sensibly test our optimized function then* - make some test data with integer-valued whole numbers so the math will all be exact. e.g. random integers in the range +- 32767 or something, so the sum of squared differences over many elements will also be an exact integer that a double can represent. That lets you test you got the basic operations right. — Peter Cordes, Feb 22 '23 at 22:11
For numerical precision, make a version that uses Kahan summation or 80-bit x87 `long double` to get a more accurate sum of the same data, and compare all your versions against that, including the simple scalar version. If for example the `z` squared-differences tend to be a lot smaller than the `x` or `y` squared differences, then it would be numerically better to keep the `z` sums in a separate element of an accumulator instead of mixing things, because FP addition loses precision when adding numbers of different magnitudes. (Shift the inputs to align mantissas, round the result to double) — Peter Cordes, Feb 22 '23 at 22:13
@PeterCordes Thanks for the suggestions, please refer to the updated the question. The problem is still very present. — Nitin Malapally, Feb 23 '23 at 11:03
Did you miss my last comment, re: adding numbers of different magnitudes being a way to lose precision? And also the comment before that re: testing to make sure you're doing the right math, using round numbers. If you haven't confirmed that you're doing the right math, start with that to confirm you get exactly equal answers when there's no rounding in any of the steps. You could possibly have an off-by-one bug or double-counting bug or something. (Less likely now without any of that shuffling.) An absolute error of 0.001716 is meaningless to me with no info about your input data. — Peter Cordes, Feb 23 '23 at 11:07
@PeterCordes Re: adding numbers of different magnitudes, we are talking about values distributed between -5 and 5. I don't see how that point is relevant. Re: testing to make sure I'm doing the right math, please read the post more carefully, I've mentioned in bold that I'm only using integer-valued floating point numbers. Or do you mean integer arrays and integer operations? — Nitin Malapally, Feb 23 '23 at 11:19
Sorry, I misread your comment, didn't notice the word "updated". The text still didn't mention the random uniform value-range of the numbers, but since you mentioned it I see that now. That should work, and should be producing exact results if `k1` and `k2` are also integer-valued or otherwise round numbers. You can sink the multiply by `k2_b_2_p` out of the loop, too, since it's invariant over all points. Scaling every element is the same as scaling the final sum. — Peter Cordes, Feb 23 '23 at 11:33
If your arrays are too huge, the sum might get so large that there's FP rounding, but both the correct and the actual total should be both be whole integers in that case, so the error should be integer. (Assuming the scale factors are integer.) Adding small numbers to a large accumulator is typically why scalar sequential sums are the worst (least accurate) for cases like this; I'd expect your SIMD result to be more accurate. — Peter Cordes, Feb 23 '23 at 11:35
So anyway, it looks like you've partially implemented one of my suggestions, of using integer-valued floats as input. But `auto_vec_variant(mat, 0.1, 0.4);` calls it with two scale factors that aren't exactly representable as binary floats, defeating the purpose of using integers. (e.g. (float)0.4 is 0.4000000059604644775390625 with bits set all the way down to the bottom of the mantissa. `double` is closer to 0.4 but still inexact. See https://www.h-schmidt.net/FloatConverter/IEEE754.html). Numbers like 0.5 and 0.25 or .125 (power-of-2 denominator) are exact. — Peter Cordes, Feb 23 '23 at 11:38
Since the result won't be exact, the deviation of `0.001716434955597` vs. the simple implementation is probably mostly rounding error in the simple implementation, same as in [Simd matmul program gives different numerical results](https://stackoverflow.com/q/55477701) — Peter Cordes, Feb 23 '23 at 11:41
@PeterCordes Thanks for pointing out the oversight of using factors `0.1, 0.4`. I changed these to `1.0, 4.0` and the result is exact to 15 decimal places. However, this is very dissatisfying. From where I stand, for the truly real-valued case, there seems to be no way to determine which code is more accurate, except perhaps speculation. P.S: if you compose this information into a small and concise answer, we could perhaps close this thread. — Nitin Malapally, Feb 23 '23 at 11:51
Maybe you're still forgetting about [my previous comment](https://stackoverflow.com/questions/75535438/increased-floating-point-error-when-using-simd-intrinsics-or-maybe-even-the-oth#comment133274313_75535438) where I suggested doing a more-precise sum with `long double` or with Kahan summation, which you can compare other implementations again. You'll probably find that the SIMD-accumulator version is more accurate, since each element is smaller than a single scalar sum, so the rounding error from adding one product to a big accumulator is not as bad on average. — Peter Cordes, Feb 23 '23 at 12:19

Increased floating point error when using SIMD intrinsics (or maybe even the other way around)

0 Answers0