I have developed a Mandelbrot generator for Windows which I have just converted to use SSE Intrinsics. To detect the end of the iterations, in normal arithmetic I do a greater than compare and break out. Doing this in SSE I can do a compare of the whole vector using _mm_cmpgt_pd/_mm_cmpgt_ps however this will write a new 128-bit vector with all 1s for the case I care about.
My question is, is there a more efficient way of detecting for all 1s rather than checking the 2 packed 64 INTs? Or if it is more efficient to detect for all 0s then I could compare for less than. Here is what I currently have:
_m128d CompareResult = Magnitude > EarlyOut;
const __m128i Tmp = *reinterpret_cast< __m128i* >( &CompareResult );
if ( Tmp.m128i_u64[ 0 ] == Tmp.m128i_u64[ 1 ] == -1 )
{
break;
}
The reason I want to find a better way is because I don't like the cast, but also because according to vTune over 30% of my iteration time is spent in this last line. I know a lot of that will be in the branch itself, but I assume I can reduce this with a better detecting of 0s or 1s.
Thanks