I have code that I am trying to speed up. First, I used the SSE intrinsics and saw significant gains. I am now trying to see if I can do similarly with the AVX intrinsics. The code, essentially, takes two arrays, adds or subtracts them as needed, squares the result and then sums all those squares together.
Below is a somewhat simplified version of the code using the sse intrinsics:
float chiList[4] __attribute__((aligned(16)));
float chi = 0.0;
__m128 res;
__m128 nres;
__m128 del;
__m128 chiInter2;
__m128 chiInter;
while(runNum<boundary)
{
chiInter = _mm_setzero_ps();
for(int i=0; i<maxPts; i+=4)
{
//load the first batch of residuals and deltas
res = _mm_load_ps(resids+i);
del = _mm_load_ps(residDeltas[param]+i);
//subtract them
nres = _mm_sub_ps(res,del);
//load them back into memory
_mm_store_ps(resids+i,nres);
//square them and add them back to chi with the fused
//multiply and add instructions
chiInter = _mm_fmadd_ps(nres, nres, chiInter);
}
//add the 4 intermediate this way because testing
//shows it is faster than the commented out way below
//so chiInter2 has chiInter reversed
chiInter2 = _mm_shuffle_ps(chiInter,chiInter,_MM_SHUFFLE(0,1,2,3));
//add the two
_mm_store_ps(chiList,_mm_add_ps(chiInter,chiInter2));
//add again
chi=chiList[0]+chiList[1];
//now do stuff with the chi^2
//alternatively, the slow way
//_mm_store_ps(chiList,chiInter);
//chi=chiList[0]+chiList[1]+chiList[2]+chiList[3];
}
This gets me to my first question: Is there any way to do the last bit (where I am taking the the 4 floats in chiInter and summing them into one float) more elegantly?
Anyways, I am now trying to implement this using the avx intrinsics, most of this process is quite straightforward, unfortunately I am stalling trying to do the last bit, trying to compress the 8 intermediate chi values into a single value.
Below is a similarly simplified piece of code for the avx intrinsics:
float chiList[8] __attribute__((aligned(32)));
__m256 res;
__m256 del;
__m256 nres;
__m256 chiInter;
while(runNum<boundary)
{
chiInter = _mm256_setzero_ps();
for(int i=0; i<maxPts; i+=8)
{
//load the first batch of residuals and deltas
res = _mm256_load_ps(resids+i);
del = _mm256_load_ps(residDeltas[param]+i);
//subtract them
nres = _mm256_sub_ps(res,del);
//load them back into memory
_mm256_store_ps(resids+i,nres);
//square them and add them back to chi with the fused
//multiply and add instructions
chiInter = _mm256_fmadd_ps(nres, nres, chiInter);
}
_mm256_store_ps(chiList,chiInter);
chi=chiList[0]+chiList[1]+chiList[2]+chiList[3]+
chiList[4]+chiList[5]+chiList[6]+chiList[7];
}
My second question is this: Is there some method like I pulled with the SSE's up above that will let me accomplish this final addition more quickly? or, if there is a better way to do what I did in the SSE intrinsics, does it have an equivalent for the AVX intrinsics?