1

I have been comparing the run times of Intrinsics vector reduction, naive vector reduction and vector reduction using openmp pragmas. However, I found the result to be different in these scenarios. The code is as follows - (intrinsics vector reduction taken from - Fastest way to do horizontal SSE vector sum (or other reduction))

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <algorithm>
#include <immintrin.h>


inline float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}


float hsum256_ps_avx(__m256 v) {
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)
}

void reduceVector_Naive(std::vector<float> values){
    float result = 0;
    for(int i=0; i<int(1e8); i++){
        result  += values.at(i);
    }
    printf("Reduction Naive = %f \n", result);
}


void reduceVector_openmp(std::vector<float> values){
    float result = 0;
    #pragma omp simd reduction(+: result)
    for(int i=0; i<int(1e8); i++){
        result  += values.at(i);
    }

    printf("Reduction OpenMP = %f \n", result);
}

void reduceVector_intrinsics(std::vector<float> values){
    float result = 0;
    float* data_ptr = values.data();

    for(int i=0; i<1e8; i+=8){
        result  += hsum256_ps_avx(_mm256_loadu_ps(data_ptr + i));
    }

    printf("Reduction Intrinsics = %f \n", result);
}


int main(){

    std::vector<float> values;

    for(int i=0; i<1e8; i++){
        values.push_back(1);
    }


    reduceVector_Naive(values);
    reduceVector_openmp(values);
    reduceVector_intrinsics(values);

// The result should be 1e8 in each case
}

However, my output is as follows -

Reduction Naive = 16777216.000000 
Reduction OpenMP = 16777216.000000 
Reduction Intrinsics = 100000000.000000 

As it can be seen, only Intrinsic functions calculated it correctly and others are faced with precision issues. I am fully aware of the precision issues one might face using float due to rounding off, so my question is, why did the intrinsics get the answer correct even though it too in fact is floating value arithmetic.

I am compiling it as - g++ -mavx2 -march=native -O3 -fopenmp main.cpp
Tried with version 7.5.0 as well as 10.3.0

TIA

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
Atharva Dubey
  • 832
  • 1
  • 8
  • 25
  • 2
    SIMD reduction is effectively using multiple accumulators (especially if you use multiple vectors, which you should to hide FP add latency), so it's [a step in the direction of pairwise summation](https://stackoverflow.com/questions/55477701/simd-matmul-program-gives-different-numerical-results). Or at least it would be if you weren't defeating the purpose by reducing to scalar inside the loop! As I mentioned in the hsum Q&A you linked (in the bullet point about array dot products), do horizontal stuff once at the end; `_mm256_add_ps` in the loop. – Peter Cordes Dec 30 '21 at 22:34
  • 2
    If OpenMP had auto-vectorized properly, you'd see `100000000.0` from it. Apparently `.at(i)` with bounds checking, instead of `[i]`, is defeating OpenMP auto-vectorization. (And of Naive if you use `-ffast-math` - https://godbolt.org/z/EeaWYPsYo). Holy crap, though, OpenMP makes insane code using `vgatherqps` instead of contiguous loads. – Peter Cordes Dec 30 '21 at 22:42
  • 1
    https://godbolt.org/z/Y8cG6dozo shows sane OpenMP vectorization, indexing a `const float *` inside the loop instead of calling `std::vector`'s overloaded `operator[]` and failing to see that the resulting access locations were contiguous. Unfortunately even with OpenMP, GCC still didn't use multiple accumulators to hide FP latency, limiting it to 32 bytes every 4 cycles. clang does better, although unrolling more than seems justified if it's only going to use 4 vectors. https://godbolt.org/z/ssoqas575 – Peter Cordes Dec 30 '21 at 22:48

1 Answers1

6

Naïve loop adds by 1.0, and it stops adding at 16777216.000000, since there's not enough significant digits in binary32 float.

See this Q&A: Why does a float variable stop incrementing at 16777216 in C#?

When you add computed horizontal sum, it will add by 8.0, so the number since when it will stop adding is about 16777216*8 = 134217728, you just don't reach it in your experiment.

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79