3

What is the best way to check whether a AVX intrinsic __m256 (vector of 8 float) contains any inf? I tried

__m256 X=_mm256_set1_ps(1.0f/0.0f);
_mm256_cmp_ps(X,X,_CMP_EQ_OQ);

but this compares to true. Note that this method will find nan (which compare to false). So one way is to check for X!=nan && 0*X==nan:

__m256 Y=_mm256_mul_ps(X,_mm256_setzero_ps());   // 0*X=nan if X=inf
_mm256_andnot_ps(_mm256_cmp_ps(Y,Y,_CMP_EQ_OQ),
                 _mm256_cmp_ps(X,X,_CMP_EQ_OQ));

However, this appears somewhat lengthy. Is there a faster way?

Walter
  • 44,150
  • 20
  • 113
  • 196

4 Answers4

7

If you want to check if a vector has any infinities:

#include <limits>

bool has_infinity(__m256 x){
    const __m256 SIGN_MASK = _mm256_set1_ps(-0.0);
    const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());

    x = _mm256_andnot_ps(SIGN_MASK, x);
    x = _mm256_cmp_ps(x, INF, _CMP_EQ_OQ);
    return _mm256_movemask_ps(x) != 0;
}

If you want a vector mask of the values that are infinity:

#include <limits>

__m256 is_infinity(__m256 x){
    const __m256 SIGN_MASK = _mm256_set1_ps(-0.0);
    const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());

    x = _mm256_andnot_ps(SIGN_MASK, x);
    x = _mm256_cmp_ps(x, INF, _CMP_EQ_OQ);
    return x;
}
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • Your `has_infinity` function seems like the old way of doing things before `ptest`. `vmovmskps` also requires `test` whereas `vptest` sets the `RFLAGS` register and does not need `test`. – Z boson Jul 03 '15 at 07:40
  • Based on Peter Comment in my answer it seems `ptest` is not a big advantage to `movmsk` anyway. – Z boson Jul 06 '15 at 09:04
  • This [thread](https://software.intel.com/en-us/forums/intel-isa-extensions/topic/293050) from this [answer](http://stackoverflow.com/a/32637106/2542702) may interest you. – Z boson Sep 23 '15 at 10:05
  • You could do this with integer operations, like left-shift by 1 to remove the sign bit, then `_mm256_cmpeq_epi32` against `set1_epi32(0xff000000)` (the bit pattern for infinity, left-shifted by 1. All bits set in the exponent, all bits clear in the mantissa, otherwise it's a NaN). Then you'd only need one constant, and the lower latency of integer compare should make up for the possible bypass latency. – Peter Cordes Aug 20 '22 at 01:11
2

I think a better solution is to use vptest rather than vmovmskps.

bool has_infinity(const __m256 &x) {
    __m256 s   = _mm256_andnot_ps(_mm256_set1_ps(-0.0), x);
    __m256 cmp = _mm256_cmp_ps(s,_mm256_set1_ps(1.0f/0.0f),0);
    __m256i cmpi = _mm256_castps_si256(cmp);
    return !_mm256_testz_si256(cmpi,cmpi);
}

The intrinsic _mm256_castps_si256 is only to make the compiler happy "This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency."

vptest is superior to vmovmskps because it sets the zero flag while vmovmskps does not. With vmovmskps the compiler has to generate test to set the zero flag.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • `movmsk` vs. `ptest` is not as obvious as you'd think. `test` can fuse with a `jcc`, and `ptest` is 2 uops. I think `ptest` is lower latency, though. – Peter Cordes Jul 03 '15 at 15:27
  • @PeterCordes, good points! I guess that's why Paul R found my solution using `ptest` to be only a bit faster than `movmsk` rather than a lot faster. – Z boson Jul 06 '15 at 09:02
1

If you don't mind also detecting NaNs, i.e. to check for numbers that aren't finite, see @gox's answer suggesting subtraction from itself (producing +0.0 in the default rounding mode for finite inputs, else NaN) and then using _mm256_movemask_epi8 to take one bit from each byte, including one from the exponent which will be non-zero for NaNs, or zero for 0.0. Testing movemask & 0x77777777 would let you ignore the sign bit so it works even with FP rounding mode = roundTowardNegative where x-x gives -0.0


If you need to detect infinity specifically, not also NaN

AVX-512F+VL has _mm256_fpclass_ps_mask + _kortestz_mask16_u8. But without AVX-512, it might be most efficient to use AVX2 integer stuff on the bit-pattern.

The IEEE binary32 bit-pattern for infinity is an all-ones exponent field and an all-zero mantissa. And the sign bit indicates whether it's + or - infinity. (NaN is the same exponent but a non-zero mantissa) So there are 2 bit-patterns we want to detect, which differ only in the high bit.

We can do this using AVX2 integer shift + cmpeq operations with only one vector constant, with lower latency than vcmpps even accounting for the bypass latency if the input came from an FP math instruction. And potentially a throughput benefit, as vpslld and/or vpcmpeqd can run on different ports than FP math/compare instructions on some CPUs. (Using a bitwise AND, ANDN, or OR to force the sign bit to a known state, clear or set, could further help with bypass latency on some CPUs, and be even better for throughput, able to execute on a wider choice of back-end execution units on more CPUs.)

(https://uops.info/ / https://agner.org/optimize/)

You could do this with integer operations, like left-shift by 1 to remove the sign bit, then _mm256_cmpeq_epi32 against set1_epi32(0xff000000) (the bit pattern for infinity, left-shifted by 1. All bits set in the exponent, all bits clear in the mantissa, otherwise it's a NaN). Then you'd only need one constant, and the lower latency of integer compare should make up for the possible bypass latency.

int has_infinity_avx2(__m256 v)
{
    __m256i bits = _mm256_castps_si256(v);
    bits = _mm256_slli_epi32(bits, 1);  // shift out sign bits.  Requires AVX2
    bits = _mm256_cmpeq_epi32(bits, _mm256_set1_epi32(0xff000000));  // infinity << 1
    return _mm256_movemask_epi8(bits);
  // or cast for _mm256_movemask_ps if you want to std::countr_zero to find out where in terms of elements instead of byte offsets
}

I had an earlier idea, but it ends up only helping if you want to test for ALL elements being infinite. Oops.

With AVX2, you can test for all elements being infinity with PTEST. I got this idea for using xor to compare for equality from EOF's comment on this question, which I used for my answer there. I thought I was going to be able to make a shorter version of a test-for-any-inf, but of course pxor only works as a test for all 256b being equal.

#include <limits>

bool all_infinity(__m256 x){
    const __m256i SIGN_MASK = _mm256_set1_epi32(0x7FFFFFFF);  // -0.0f inverted
    const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());

    x = _mm256_xor_si256(x, INF);  // other than sign bit, x will be all-zero only if all the bits match.
    return _mm256_testz_si256(x, SIGN_MASK); // flags are ready to branch on directly
}

With AVX512, there's a __mmask8 _mm512_fpclass_pd_mask (__m512d a, int imm8). (vfpclasspd). (See Intel's guide). Its output is a mask register, which you can branch on directly. You can test for any/all of +/- zero, +/- inf, Q/S NaN, Denormal, Negative.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I did something similar here [sse-testing-equality-between-two-m128i-variables](http://stackoverflow.com/questions/26880863/sse-testing-equality-between-two-m128i-variables/26883316#26883316). – Z boson Jul 03 '15 at 07:41
  • I think you can come up with a `has_infinity` using `vptest`. – Z boson Jul 03 '15 at 07:44
  • By itself? I don't think so, because all you could do is test if any of the bit-pattern that means `inf` is present. What you need is to test that ALL the bits are set. I think you need something like `pcmpeq` or `cmpps` to turn a matches-exact-bit-pattern into an all-set or none-set, if you want your condition to be true when one element matches your pattern, but other elements can be anything. – Peter Cordes Jul 03 '15 at 15:22
1

This short one tests whether any of floats in a vector are corrupted (NAN or INFINITY) or not:

int is_corrupted( const __m256 & float_v8 ) {
    __m256 self_sub_v8  = _mm256_sub_ps( float_v8, float_v8 );

    return _mm256_movemask_epi8( _mm256_castps_si256( self_sub_v8 ) );
}

It's 2 AVX2 instructions only without additional constants, and uses a trick -- any normal "self"-subtraction should end as a zero, so a movemask_epi8 later should extract some of its bits indicating whether it's a zero or a NAN/INFINITY. I haven't tested it on different platforms.

Edit: see Peter's important comments on rounding toward negative.

gox
  • 39
  • 6
  • Oh interesting, yeah by using an `epi8` movemask, you're getting bits from the interior bytes of the floats, not just the sign bits. One of those bits is the 2nd bit of the exponent field https://en.wikipedia.org/wiki/Single-precision_floating-point_format#IEEE_754_standard:_binary32 (which is all-ones for inf/nan). And yes, `x-x` is required to produce `+0.0` for finite inputs, all-zero bit-pattern, not `-0.0` (which would have the sign bit set). – Peter Cordes Aug 19 '22 at 06:27
  • [What is (+0)+(-0) by IEEE floating point standard?](https://stackoverflow.com/q/28949774) includes a quote from the IEEE standard for the case where *the difference of two operands with like signs ... is exactly zero* - **the result shall be +0 except when the current rounding mode is `roundTowardNegative`**. In a program that sometimes sets the rounding mode to floor rounding, or a careful library, you could test `movemask & 0x77777777` so you ignore the sign bits. x86 can do this about as cheaply as testing for zero, e.g. `test eax, 0x77777777`/`jnz` instead of `test eax,eax`/`jnz` – Peter Cordes Aug 19 '22 at 06:29
  • And yeah, `Inf-Inf` is NaN, unlike using `_mm256_cmpeq_ps` - only NaN isn't equal to itself. Inf == Inf is true. – Peter Cordes Aug 19 '22 at 06:39
  • Exactly, you are right about the rounding, I didn't take that into account. – gox Aug 19 '22 at 21:44
  • I was only reminded of it when checking the standard to confirm that you're guaranteed to get +0. I think it's very rare that code would set a non-default rounding mode so it's not much of a downside to this idea, although it is worth mentioning. – Peter Cordes Aug 19 '22 at 21:46
  • Feel free to quote or paraphrase any of my comments in your answer directly. – Peter Cordes Aug 20 '22 at 00:26
  • @PeterCordes Definitely, it's mentioned now. However, what I am a bit concerned about is whether all implementations of SUB in hardware used on the same vector register round the result to exactly zero, or they could introduce some small "rounding errors" -- see representations of the smallest normal and subnormal numbers [here](https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Notable_single-precision_cases). In such a (hypothetical ?) case, probably the best way would be to insert an additional shift like `_mm256_srli_epi32( self_sub_v8, 7 )` in the code above. – gox Aug 20 '22 at 00:55
  • SSE/AVX math follows IEEE rules: basic operations (+ - * / and sqrt) are required to produce a "correctly rounded" result: error <= 0.5ulp of the exact result, i.e. all bits of the mantissa must be correct, even the least significant. The mathematical result of subtracting a finite number from itself is 0, so that's what IEEE FP subtraction must produce for any finite input. (Otherwise NaN). – Peter Cordes Aug 20 '22 at 01:04
  • IEEE FP math isn't just arbitrarily wrong (that's one of the huge benefits of the standard), you only get rounding error when the exact result can't be expressed as a float or double. (Except for math library functions like exp/log/sin/cos and so on, which are not "basic" operations, and are numerical algorithms implemented with multiple basic operations.) – Peter Cordes Aug 20 '22 at 01:07