I want to sum all 32bit element in a 256 register but there isn't any intrinsics instruction or if there is I couldn't help what I want. So I did some thing like this to sum but this method generates many assembly instruction when compiled.
My method :
_mm256_store_ps(&temp4[0], sum0_i);
c_result[i][j]= temp4[0]+temp4[1]+temp4[2]+temp4[3]+temp4[4]+temp4[5]+temp4[6]+temp4[7];
Assembly output:
vmovaps %ymm0, (%rsp)
vmovss (%rsp), %xmm0
vaddss 4(%rsp), %xmm0, %xmm0
vaddss 8(%rsp), %xmm0, %xmm0
vaddss 12(%rsp), %xmm0, %xmm0
vaddss 16(%rsp), %xmm0, %xmm0
vaddss 20(%rsp), %xmm0, %xmm0
vaddss 24(%rsp), %xmm0, %xmm0
vaddss 28(%rsp), %xmm0, %xmm0
vmovss %xmm0, c_result(%r8,%rsi)
So the question is how can I sum all elements faster and more professional and store them to the 32 bit array in memory? I tried hadd
but didn't improve the performance. because I still have memory problem to save them and also hadd
latency and throughput killing the time more than vaddss