1

I want to sum all 32bit element in a 256 register but there isn't any intrinsics instruction or if there is I couldn't help what I want. So I did some thing like this to sum but this method generates many assembly instruction when compiled.

My method :

_mm256_store_ps(&temp4[0], sum0_i); 
            c_result[i][j]= temp4[0]+temp4[1]+temp4[2]+temp4[3]+temp4[4]+temp4[5]+temp4[6]+temp4[7];

Assembly output:

    vmovaps %ymm0, (%rsp)
    vmovss  (%rsp), %xmm0
    vaddss  4(%rsp), %xmm0, %xmm0
    vaddss  8(%rsp), %xmm0, %xmm0
    vaddss  12(%rsp), %xmm0, %xmm0
    vaddss  16(%rsp), %xmm0, %xmm0
    vaddss  20(%rsp), %xmm0, %xmm0
    vaddss  24(%rsp), %xmm0, %xmm0
    vaddss  28(%rsp), %xmm0, %xmm0
    vmovss  %xmm0, c_result(%r8,%rsi)

So the question is how can I sum all elements faster and more professional and store them to the 32 bit array in memory? I tried hadd but didn't improve the performance. because I still have memory problem to save them and also hadd latency and throughput killing the time more than vaddss

ADMS
  • 117
  • 3
  • 18
  • 1
    [horizontal sum of 8 packed 32bit floats](http://stackoverflow.com/q/13879609/995714), [How to sum __m256 horizontally?](http://stackoverflow.com/q/13219146/995714), [Horizontal sum of 32-bit floats in 256-bit AVX vector](http://stackoverflow.com/q/23189488/995714) – phuclv Apr 12 '16 at 11:31
  • 1
    I read and tried them but the speedup didn't change – ADMS Apr 12 '16 at 11:35
  • 2
    @ADMS If the answers to the referenced questions don't seed your code up, this is probably not your bottleneck. Why are you optimizing non time-critical code? – EOF Apr 12 '16 at 11:44
  • 1
    @ADMS: The 93-fold speedup, to me, suggests the optimizer is just removing the whole code if you comment out this part, since the rest of the code probably has no semantically visible side effects. – EOF Apr 12 '16 at 11:58
  • Good point, but I still have that problem with latency of `hadd` and other thing in the question. – ADMS Apr 12 '16 at 12:02
  • OK I will find the answer and share here. – ADMS Apr 12 '16 at 12:06
  • 1
    Start with `_mm256_extractf128_ps`, `_mm_add_ps` the two halves together, then use [the existing methods for reducing a 128b vector](http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86). – Peter Cordes Apr 12 '16 at 21:11
  • Does it cost any penalty exchanging between `AVX` to `SSE` or vise versa ? – ADMS Apr 12 '16 at 21:43

1 Answers1

0

You might start with the code any optimizing compiler generates for vectorized sum reduction with or without accumulate(), cilkplus reducer, or omp simd reduction. No doubt there is a step adding 128 bit sub registers, one with hadd, and so on.

tim18
  • 580
  • 1
  • 4
  • 8