0

I want sum up all elements of a big vector ary. My idea was to do it with a horizontal sum.

enter image description here

const int simd_width = 16/sizeof(float); 
float helper[simd_width];

//take the first 4 elements
const __m128 a4 = _mm_load_ps(ary);

for(int i=0; i<N-simd_width; i+=simd_width){
     const __m128 b4 = _mm_load_ps(ary+i+simd_width);
     //save temporary result in helper array
     _mm_store_ps(helper, _mm_hadd_ps(a4,b4)); //C
     const __m128 a4 = _mm_load_ps(helper);

}

I looked for a method, with which i can assign the resulting vector directly to the quadfloat a4 directly like _mm_store_ps(a4, _mm_hadd_ps(a4,b4)) Is there such a Intel method? (It is my first time to work with SSE -maybe the whole code snippet is wrong)

Suslik
  • 929
  • 8
  • 28
  • Sorry for my bad english skills. I wonder if there is a intel intrinsic with which i can directly assign the new value `C = _mm_hadd_ps(a4,b4)` to a4. Or can i overwrite a4 directly like `a4 = _mm_hadd_ps(a4,b4)` – Suslik Oct 22 '18 at 19:22
  • No intrinsic, you can just assign it with `=` – harold Oct 22 '18 at 19:22
  • so i can reuse a4 directly with `a4 = _mm_hadd_ps(a4,b4);` – Suslik Oct 22 '18 at 19:23
  • 1
    Yes exactly. The horizontal add is slow though, so it would be better to do this with fewer of them and more `_mm_add_ps` – harold Oct 22 '18 at 19:24
  • Thank you a lot for the information. I will try it. – Suslik Oct 22 '18 at 19:24
  • 3
    Look at how clang or ICC auto-vectorize a loop, using vertical add with two or more accumulators, then a horizontal sum at the end. [Unroll loop and do independent sum with vectorization](https://stackoverflow.com/q/33038542). That's much faster than doing 2 shuffles per vector with `haddps`. See [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/a/35270026) – Peter Cordes Oct 22 '18 at 19:41
  • Perfect. Thanks you. This is a lot better. – Suslik Oct 22 '18 at 20:01

1 Answers1

2

As Peter suggested, do not use horizontal sums. Use vertical sums.

For example, in pseudo-code, with simd width = 2

SIMD sum = {0,0}; // we use 2 accumulators
for (int i = 0; i + 1 < n; i += 2)
    sum = simd_add(sum, simd_load(x+i));
float s = horizzontal_add(sum);
if (n & 1)  // n was not a multiple of 2?
   s += x[n-1]; // deal with last element
Fabio
  • 2,105
  • 16
  • 26