16

Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.

double sum = 0.0;

for(k = 0; k < 3072; k++) {
    sum += fabs(sima[k] - simb[k]);
}

double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));

Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.

Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.

Community
  • 1
  • 1
Pharaun
  • 1,232
  • 1
  • 12
  • 24

3 Answers3

37

I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:

inline __m128 abs_ps(__m128 x) {
    static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
    return _mm_andnot_ps(sign_mask, x);
}

inline __m128d abs_pd(__m128d x) {
    static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
    return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}

Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:

double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;

__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
    sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
    sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}

__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}

By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.

Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.

Norbert P.
  • 2,767
  • 1
  • 18
  • 22
  • When I have some time soon, I'll take a look at this, but I wanted to comment on the OpenMP aspect... The snippet that I used/presented in this question was a nested loop inside *another* vastly larger loop, and I used OpenMP on the outer loop, not the inner loop :) But yes you would be correct if the OpenMP was for this smaller inner loop. – Pharaun May 16 '11 at 18:13
  • Hello, I was able to implement this and it did speed things up by several percentage which was nice :) And I really liked the Agner Fog's manuals, they were really nice! – Pharaun Jun 23 '11 at 21:22
9

The maximum of -x and x should be abs(x). Here it is in code:

x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)
orlp
  • 112,504
  • 36
  • 218
  • 315
2

Probably the easiest way is as follows:

__m128d vsum = _mm_set1_pd(0.0);        // init partial sums
for (k = 0; k < 3072; k += 2)
{
    __m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
    __m128d vb = _mm_load_pd(&simb[k]);
    __m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
    __m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
    __m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff);        // calc abs diff = max(diff, - diff)
    vsum = _mm_add_pd(vsum, vabsdiff);  // accumulate two partial sums
}

Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.

Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    The second part where I took care of the partial sums took me a bit to pull together, but overall the whole solution worked. Thank you! – Pharaun Apr 02 '11 at 00:35
  • @Pharaun: thanks for the feedback - did you benchmark the SSE code versus the original scalar code, and if so how much performance improvement did you see ? – Paul R Apr 02 '11 at 09:55
  • @Paul with a single-thread implementation I knocked off 20% but with multi-threaded via openMP it was just a percentage, so I still need to work on tweaking it and looking into moving over to floats for more performance. – Pharaun Apr 03 '11 at 00:41
  • @Paul, just moved it on over to floating point implement and its knocking off ~15 to 30% off execution time over the plain C and this is even with openMP enabled, so seems like it helped my memory bandwidth + quad pumping the floats lead to a big win. – Pharaun Apr 03 '11 at 07:00
  • @Pharaun: great - glad it helped. If you can upgrade to a recent Sandy Bridge CPU and use AVX then you should see a much bigger improvement. (AVX does 8 x floats compared to 4 x floats with SSE). – Paul R Apr 03 '11 at 07:22
  • @Paul Some day, some day! But yeah upgrading to Sandy Bridge isn't something I can do at the moment tho. – Pharaun Apr 03 '11 at 08:21
  • And *another* anonymous down-vote, 4 years later - would it be so hard to leave a comment explaining what the problem is ? I'm happy to improve or even delete the answer as appropriate, but I really don't see what's wrong with it ? – Paul R Jun 14 '15 at 06:39
  • This solution is vastly more complex than it needs to be -- in particular, no arithmetic is needed to accomplish this. Flipping the leading bit of the binary IEE754 representation as shown by Norbert P. below is the way to go. – Wenzel Jakob Aug 03 '16 at 20:38
  • @WenzelJakob: "vastly more complex" seems to be "vastly" over-stating it - you still need to subtract the two input values, so the only real difference is that I'm doing a subtract and a max in my code, versus a bit flip in Norbert's solution. There is also the question of whether correct behaviour is needed for edge cases such as `NaN`s - I'm not sure that the bit flip solution gives strictly correct results in such cases, but this may not be important. Anyway, point taken - thank you for taking the time to explain your down-vote. – Paul R Aug 04 '16 at 08:08