assignment with intel Intrinsics - horizontal add

Question

I want sum up all elements of a big vector ary. My idea was to do it with a horizontal sum.

const int simd_width = 16/sizeof(float); 
float helper[simd_width];

//take the first 4 elements
const __m128 a4 = _mm_load_ps(ary);

for(int i=0; i<N-simd_width; i+=simd_width){
     const __m128 b4 = _mm_load_ps(ary+i+simd_width);
     //save temporary result in helper array
     _mm_store_ps(helper, _mm_hadd_ps(a4,b4)); //C
     const __m128 a4 = _mm_load_ps(helper);

}

I looked for a method, with which i can assign the resulting vector directly to the quadfloat a4 directly like _mm_store_ps(a4, _mm_hadd_ps(a4,b4)) Is there such a Intel method? (It is my first time to work with SSE -maybe the whole code snippet is wrong)

Sorry for my bad english skills. I wonder if there is a intel intrinsic with which i can directly assign the new value `C = _mm_hadd_ps(a4,b4)` to a4. Or can i overwrite a4 directly like `a4 = _mm_hadd_ps(a4,b4)` — Suslik, Oct 22 '18 at 19:22
Yes exactly. The horizontal add is slow though, so it would be better to do this with fewer of them and more `_mm_add_ps` — harold, Oct 22 '18 at 19:24
Look at how clang or ICC auto-vectorize a loop, using vertical add with two or more accumulators, then a horizontal sum at the end. [Unroll loop and do independent sum with vectorization](https://stackoverflow.com/q/33038542). That's much faster than doing 2 shuffles per vector with `haddps`. See [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/a/35270026) — Peter Cordes, Oct 22 '18 at 19:41

score 2 · Accepted Answer · answered Nov 11 '18 at 14:22

As Peter suggested, do not use horizontal sums. Use vertical sums.

For example, in pseudo-code, with simd width = 2

SIMD sum = {0,0}; // we use 2 accumulators
for (int i = 0; i + 1 < n; i += 2)
    sum = simd_add(sum, simd_load(x+i));
float s = horizzontal_add(sum);
if (n & 1)  // n was not a multiple of 2?
   s += x[n-1]; // deal with last element

assignment with intel Intrinsics - horizontal add

1 Answers1

Related