It's too bad gcc / clang don't autovectorize this, because it's pretty easy (Godbolt - clang 3.7):
// clang doesn't let this be a constexpr to mark it as a pure function :/
const bool bbox(const BBox& b, const Vec& v)
{
// if you can guarantee alignment, then these can be load_ps
// saving a separate load instruction for SSE. (AVX can fold unaligned loads)
// maybe make Vec a union with __m128
__m128 blo = _mm_loadu_ps(&b.min.x);
__m128 bhi = _mm_loadu_ps(&b.max.x);
__m128 vv = _mm_loadu_ps(&v.x);
blo = _mm_cmple_ps(blo, vv);
bhi = _mm_cmple_ps(vv, bhi);
__m128 anded = _mm_and_ps(blo, bhi);
int mask = _mm_movemask_ps(anded);
// mask away the result from the padding element,
// check that all the bits are set
return (mask & 0b0111) == 0b0111;
}
This compiles to
movups xmm0, xmmword ptr [rdi]
movups xmm1, xmmword ptr [rdi + 16]
movups xmm2, xmmword ptr [rsi]
cmpleps xmm0, xmm2
cmpleps xmm2, xmm1
andps xmm2, xmm0
movmskps eax, xmm2
and eax, 7
cmp eax, 7
sete al
ret
If you invert the sense of the comparison (cmpnle), to test for being outside the bounding box on any axis, you could do something like
int mask1 = _mm_movemask_ps(blo);
int mask2 = _mm_movemask_ps(bhi);
return !(mask1 | mask2);
which might compile to
movmskps
movmskps
or
setnz
So the integer test is cheaper, and you replace a vector AND with another movmsk (about equal cost).
I was thinking for a while that doing it that way would mean a NaN counted as inside the box, but actually cmpnleps is true when one of the operands in NaN. (cmpleps is false in this case, so it really is the opposite).
I haven't thought through what happens to the padding in this case. It might end up being !((mask1|mask2) & 0b0111)
, which is still more efficient for x86, because the test
instruction does an AND for free, and can macro-fuse with a branch instruction on Intel and AMD.
movmskps is 2 m-ops and high-latency on AMD, but using vectors is probably still a win. Two movmskps instructions might be slightly worse on AMD than the code I posted first, but it is pipelined so they can both be transferring after the cmpps instructions finish.