If you don't mind also detecting NaNs, i.e. to check for numbers that aren't finite, see @gox's answer suggesting subtraction from itself (producing +0.0 in the default rounding mode for finite inputs, else NaN) and then using _mm256_movemask_epi8
to take one bit from each byte, including one from the exponent which will be non-zero for NaNs, or zero for 0.0. Testing movemask & 0x77777777
would let you ignore the sign bit so it works even with FP rounding mode = roundTowardNegative where x-x
gives -0.0
If you need to detect infinity specifically, not also NaN
AVX-512F+VL has _mm256_fpclass_ps_mask
+ _kortestz_mask16_u8
. But without AVX-512, it might be most efficient to use AVX2 integer stuff on the bit-pattern.
The IEEE binary32 bit-pattern for infinity is an all-ones exponent field and an all-zero mantissa. And the sign bit indicates whether it's + or - infinity. (NaN is the same exponent but a non-zero mantissa) So there are 2 bit-patterns we want to detect, which differ only in the high bit.
We can do this using AVX2 integer shift + cmpeq operations with only one vector constant, with lower latency than vcmpps
even accounting for the bypass latency if the input came from an FP math instruction. And potentially a throughput benefit, as vpslld
and/or vpcmpeqd
can run on different ports than FP math/compare instructions on some CPUs. (Using a bitwise AND, ANDN, or OR to force the sign bit to a known state, clear or set, could further help with bypass latency on some CPUs, and be even better for throughput, able to execute on a wider choice of back-end execution units on more CPUs.)
(https://uops.info/ / https://agner.org/optimize/)
You could do this with integer operations, like left-shift by 1 to remove the sign bit, then _mm256_cmpeq_epi32
against set1_epi32(0xff000000)
(the bit pattern for infinity, left-shifted by 1. All bits set in the exponent, all bits clear in the mantissa, otherwise it's a NaN). Then you'd only need one constant, and the lower latency of integer compare should make up for the possible bypass latency.
int has_infinity_avx2(__m256 v)
{
__m256i bits = _mm256_castps_si256(v);
bits = _mm256_slli_epi32(bits, 1); // shift out sign bits. Requires AVX2
bits = _mm256_cmpeq_epi32(bits, _mm256_set1_epi32(0xff000000)); // infinity << 1
return _mm256_movemask_epi8(bits);
// or cast for _mm256_movemask_ps if you want to std::countr_zero to find out where in terms of elements instead of byte offsets
}
I had an earlier idea, but it ends up only helping if you want to test for ALL elements being infinite. Oops.
With AVX2, you can test for all elements being infinity with PTEST
. I got this idea for using xor to compare for equality from EOF's comment on this question, which I used for my answer there. I thought I was going to be able to make a shorter version of a test-for-any-inf, but of course pxor
only works as a test for all 256b being equal.
#include <limits>
bool all_infinity(__m256 x){
const __m256i SIGN_MASK = _mm256_set1_epi32(0x7FFFFFFF); // -0.0f inverted
const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());
x = _mm256_xor_si256(x, INF); // other than sign bit, x will be all-zero only if all the bits match.
return _mm256_testz_si256(x, SIGN_MASK); // flags are ready to branch on directly
}
With AVX512, there's a __mmask8 _mm512_fpclass_pd_mask (__m512d a, int imm8)
. (vfpclasspd
). (See Intel's guide). Its output is a mask register, which you can branch on directly. You can test for any/all of +/- zero, +/- inf, Q/S NaN, Denormal, Negative.