3

I have developed a Mandelbrot generator for Windows which I have just converted to use SSE Intrinsics. To detect the end of the iterations, in normal arithmetic I do a greater than compare and break out. Doing this in SSE I can do a compare of the whole vector using _mm_cmpgt_pd/_mm_cmpgt_ps however this will write a new 128-bit vector with all 1s for the case I care about.

My question is, is there a more efficient way of detecting for all 1s rather than checking the 2 packed 64 INTs? Or if it is more efficient to detect for all 0s then I could compare for less than. Here is what I currently have:

_m128d CompareResult = Magnitude > EarlyOut;
const __m128i Tmp = *reinterpret_cast< __m128i* >( &CompareResult );
if ( Tmp.m128i_u64[ 0 ] == Tmp.m128i_u64[ 1 ] == -1 )
{
    break;
}

The reason I want to find a better way is because I don't like the cast, but also because according to vTune over 30% of my iteration time is spent in this last line. I know a lot of that will be in the branch itself, but I assume I can reduce this with a better detecting of 0s or 1s.

Thanks

jww
  • 97,681
  • 90
  • 411
  • 885
allanmb
  • 321
  • 3
  • 14

1 Answers1

7

Assuming you're testing the result of a compare then you can just extract the MS bits of each byte as a 16 bit int and test this, e.g.

int mask = _mm_movemask_epi8((__m128i)CompareResult);
if (mask == 0xffff)
{
    // compare results are all "true"
}

Note that this is one example of a more general technique for SIMD predicates in SSE, i.e.

mask == 0xffff // all "true"
mask == 0x0000 // all "false"
mask != 0xffff // any "false"
mask != 0x0000 // any "true"
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    I think this is the most efficient solution also. Might need 0xFFFF in place of 0xFF when 128 bit data is used. –  Apr 15 '13 at 16:48
  • Thanks for the reply. I popped this in and it was very slightly slower than my original version. Also, I still needed the ugly cast as my SSE variable type is __m128d although I made it into a const reference which saved some time too. – allanmb Apr 16 '13 at 14:35
  • Note that the performance picture may vary on different CPUs and with different compilers, i.e. you might get a win in some cases but not in others, so if this is code for general distribution then you might want to benchmark with other CPUs if you can. – Paul R Apr 16 '13 at 15:27
  • If you've got SSE4 (`PTEST`) then see the question linked to above. – FrankH. May 02 '13 at 16:01
  • @FrankH: `PTEST` is good for checking for all zeroes, but I'm not sure it helps with the all ones case ? – Paul R May 02 '13 at 16:04
  • I'd think that _inverting your comparison_ would do that ? – FrankH. May 02 '13 at 16:14
  • @FrankH: unfortunately not - "not all bits equal to zero" is not the same as "all bits equal to one". – Paul R May 02 '13 at 20:36
  • @PaulR: That's not what I meant. Instead of using `compareResult = (magnitude > earlyOut)` you can use `compareResult = (magnitude <= earlyOut)`. If the former is all ones, the latter is all zeroes. – FrankH. May 03 '13 at 08:09
  • @FrankH: I see what you mean - I was thinking more of the more general problem of testing for any/all 1s/0s, but you're right in the case of a compare. – Paul R May 03 '13 at 08:41