Using sse and avx intrinsics to add a set of packed singles into one value

Question

I have code that I am trying to speed up. First, I used the SSE intrinsics and saw significant gains. I am now trying to see if I can do similarly with the AVX intrinsics. The code, essentially, takes two arrays, adds or subtracts them as needed, squares the result and then sums all those squares together.

Below is a somewhat simplified version of the code using the sse intrinsics:

float chiList[4] __attribute__((aligned(16)));
float chi = 0.0;
__m128 res;
__m128 nres;
__m128 del;
__m128 chiInter2;
__m128 chiInter;
while(runNum<boundary)
{
    chiInter = _mm_setzero_ps();
    for(int i=0; i<maxPts; i+=4)
    {
        //load the first batch of residuals and deltas
        res = _mm_load_ps(resids+i);
        del = _mm_load_ps(residDeltas[param]+i);
        //subtract them
        nres = _mm_sub_ps(res,del);
        //load them back into memory
        _mm_store_ps(resids+i,nres);
        //square them and add them back to chi with the fused
        //multiply and add instructions
        chiInter = _mm_fmadd_ps(nres, nres, chiInter);
    }
    //add the 4 intermediate this way because testing 
    //shows it is faster than the commented out way below
    //so chiInter2 has chiInter reversed
    chiInter2 = _mm_shuffle_ps(chiInter,chiInter,_MM_SHUFFLE(0,1,2,3));
    //add the two
    _mm_store_ps(chiList,_mm_add_ps(chiInter,chiInter2));
    //add again
    chi=chiList[0]+chiList[1];
    //now do stuff with the chi^2
    //alternatively, the slow way
    //_mm_store_ps(chiList,chiInter);
    //chi=chiList[0]+chiList[1]+chiList[2]+chiList[3];
}

This gets me to my first question: Is there any way to do the last bit (where I am taking the the 4 floats in chiInter and summing them into one float) more elegantly?

Anyways, I am now trying to implement this using the avx intrinsics, most of this process is quite straightforward, unfortunately I am stalling trying to do the last bit, trying to compress the 8 intermediate chi values into a single value.

Below is a similarly simplified piece of code for the avx intrinsics:

float chiList[8] __attribute__((aligned(32)));
__m256 res;
__m256 del;
__m256 nres;
__m256 chiInter;
while(runNum<boundary)
{
    chiInter = _mm256_setzero_ps();
    for(int i=0; i<maxPts; i+=8)
    {
        //load the first batch of residuals and deltas
        res = _mm256_load_ps(resids+i);
        del = _mm256_load_ps(residDeltas[param]+i);
        //subtract them
        nres = _mm256_sub_ps(res,del);
        //load them back into memory
        _mm256_store_ps(resids+i,nres);
        //square them and add them back to chi with the fused
        //multiply and add instructions
        chiInter = _mm256_fmadd_ps(nres, nres, chiInter);
    }
    _mm256_store_ps(chiList,chiInter);
    chi=chiList[0]+chiList[1]+chiList[2]+chiList[3]+
        chiList[4]+chiList[5]+chiList[6]+chiList[7];
}

My second question is this: Is there some method like I pulled with the SSE's up above that will let me accomplish this final addition more quickly? or, if there is a better way to do what I did in the SSE intrinsics, does it have an equivalent for the AVX intrinsics?

Don't worry too much about the efficiency of the final sum - assuming `maxPts` is reasonably large then the total time will be dominated by what's going on inside the for loop, and any preamble/postamble code will be irrelevant performance-wise. — Paul R, Mar 28 '14 at 20:13
@PaulR, Unfortunately, maxPts is small, usually not more than 32. And yes, despite the tiny size I see enormous gains using the sse vs a naive loop, ie 144ns / iteration --> 14ns / iteration. — James Matta, Mar 28 '14 at 20:14
See the related: http://stackoverflow.com/q/9775538/1918193 . I am surprised you didn't try using haddps. Keywords to search for: horizontal add/sum. — Marc Glisse, Mar 28 '14 at 20:21
@MarcGlisse, I didn't because I did not know what I was looking at when I passed over it. Thank you very very much for the information. If you write it up as an answer I will gladly accept it. — James Matta, Mar 28 '14 at 20:23
If maxPts is large you might want to check the exponent of your final accumulated square of errors in `chi`, floats (single precision floating point) only have 24 bits of significand http://en.wikipedia.org/wiki/Single-precision_floating-point_format as your accumulated error grows, you will lose precision and it could get to the point where you stop accumulating any further (the intermediate squared delta of residuals may be 2^24 times smaller than your running accumulated `chi`, and when the CPU normalizes for addition they go to zero. — amdn, Mar 28 '14 at 20:46
@amdn, As I said to PaulR, maxPts is quite small, almost never more than 32. Also, I am using this code to search a multi-trillion point parameter space to find minima, the loss of precision does not concern me as large numbers will have larger exponents allowing them to be thrown out from outputting to the disk. — James Matta, Mar 28 '14 at 20:51

Marc Glisse · Accepted Answer · 2014-03-28T21:15:21.480

This operation is called horizontal sum. Say you have a vector v={x0,x1,x2,x3,x4,x5,x6,x7}. First, extract the high/low parts so you have w1={x0,x1,x2,x3} and w2={x4,x5,x6,x7}. Now call _mm_hadd_ps(w1, w2), which gives: tmp1={x0+x1,x2+x3,x4+x5,x6+x7}. Again, _mm_hadd_ps(tmp1,tmp1) gives tmp2={x0+x1+x2+x3,x4+x5+x6+x7,...}. One last time, _mm_hadd_ps(tmp2,tmp2) gives tmp3={x0+x1+x2+x3+x4+x5+x6+x7,...}. You could also replace the first _mm_hadd_ps with a simple _mm_add_ps.

This is all untested and written from the doc. And no promise on the speed either...

Someone on the Intel forum shows another variant (look for HsumAvxFlt).

We can also look at what gcc suggests by compiling this code with gcc test.c -Ofast -mavx2 -S

float f(float*t){
  t=(float*)__builtin_assume_aligned(t,32);
  float r=0;
  for(int i=0;i<8;i++)
    r+=t[i];
  return r;
}

The generated test.s contains:

vhaddps %ymm0, %ymm0, %ymm0
vhaddps %ymm0, %ymm0, %ymm1
vperm2f128  $1, %ymm1, %ymm1, %ymm0
vaddps  %ymm1, %ymm0, %ymm0

I am a bit surprised the last instruction isn't vaddss, but I guess it doesn't matter much.

Wow, this helps a lot. Thank you very much. I have been writing my improvements via searching and testing and occasionally stumbling across things. The intrinsics have saved me as I really really did not want to have to write inline assembly, but I knew nothing about them until a week ago. — James Matta, Mar 28 '14 at 20:53

Using sse and avx intrinsics to add a set of packed singles into one value

1 Answers1