What causes the different NaN behavior when compiling `_mm_ucomilt_ss` intrinsic?

Question

Can someone explain me why the following code fails for GCC 8.5 with NaNs?

bool isfinite_sse42(float num)
{    
    return _mm_ucomilt_ss(_mm_set_ss(std::abs(num)),
                          _mm_set_ss(std::numeric_limits<float>::infinity())) == 1;
}

My expectation for GCC 8.5 would be to return false.

The Intel Intrinsics guide for _mm_ucomilt_ss says

RETURN ( a[31:0] != NaN AND b[31:0] != NaN AND a[31:0] == b[31:0] ) ? 1 : 0

i.e., if either a or b is NaN it returns 0. On assembly level (Godbolt) one can see a ucomiss abs(x), Infinity followed by a setb.

# GCC8.5 -O2  doesn't match documented intrinsic behaviour for NaN
        ucomiss xmm0, DWORD PTR .LC2[rip]
        setb    al

Interestingly newer GCCs and Clang swap the comparison from a < b to b > a and therefore use seta. But why does the code with setb returns true for NaN and why seta returns false for NaN?

`setb`/`seta` only look at the carry flag which is set to `1` for unordered. — Jester, Jun 01 '23 at 13:44
re: your example on Godbolt: `x < INFINITY` will be true for `-INFINITY`. That's why `isfinite_builtin` uses `andps` to take the absolute value first. — Peter Cordes, Jun 01 '23 at 14:05
@PeterCordes, I removed the `abs` code beforehand. I changed my code example. But the problem still persists. — jerryct, Jun 01 '23 at 14:47
@Jester, I sadly don't get the difference. `seta` gives the correct result but `setb` not. Can you please further explain the difference? From the intrinsics description it should return 0 when either operand is NaN. But this is not the case. Why? When looking at the assembly I just saw the difference `seta` vs. `setb` but cannot explain what is happening (maybe it is even unrelated to the issue?). — jerryct, Jun 01 '23 at 14:49
Yeah, GCC8.5 seems to be buggy, not implementing the documented semantics of the intrinsic for the NaN case which require either checking PF separately, or doing it as `ucomiss Inf, abs` so the unordered case sets CF the same way as `abs < Inf`. See https://www.felixcloutier.com/x86/ucomiss#operation or the nicer table in https://www.felixcloutier.com/x86/fcomi:fcomip:fucomi:fucomip — Peter Cordes, Jun 01 '23 at 14:56

Peter Cordes · Accepted Answer · 2023-06-01T16:00:09.297

GCC is buggy before GCC13, not implementing the documented semantics of the intrinsic for the NaN case which require either checking PF separately, or doing it as ucomiss Inf, abs so the unordered case sets CF the same way as abs < Inf.

See https://www.felixcloutier.com/x86/ucomiss#operation or the nicer table in https://www.felixcloutier.com/x86/fcomi:fcomip:fucomi:fucomip . (All x86 scalar FP compares that set EFLAGS do it the same way, matching historical fcom / fstsw / sahf.)

Comparison Results	ZF	PF	CF
left > right	0	0	0
left < right	0	0	1
left = right	1	0	0
Unordered	1	1	1

Notice that CF is set for both the left < right and unordered cases, but not for the other two cases.

If you can arrange things such that you can check for > or >=, you don't need to setnp cl / and al, cl to rule out Unordered. This is what clang 16 and GCC 13 do to get correct results from ucomiss inf, abs / seta.

GCC8.5 does the right thing if you write abs(x) < infinity, it's only the scalar intrinsic that it doesn't implement properly. (With plain scalar code, it uses comiss instead of ucomiss, the only difference being that it will update the FP environment with a #I FP-exception on QNaN as well as SNaN.)

This requires a separate movss load instead of a memory source. But this does let GCC avoid the useless SSE4.1 insertps instruction that zeros the high 3 elements of XMM0, which ucomiss doesn't read anyway. Clang sees that and optimizes away that part of _mm_set_ss(num) but GCC doesn't. The lack of an efficient way to go from a scalar float to a __m128 with don't-care upper elements is a persistent problem in Intel's intrinsics API that only some compilers manage to optimize around. (How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?) A float is just the low element of a __m128.

Thanks. Just a side note: GCC has the wrong behavior up to GCC12 https://godbolt.org/z/5e6914jYb — jerryct, Jun 01 '23 at 15:42
And `_mm_ucomieq_ss` has the same issue: https://godbolt.org/z/zx1T6z7vT. This time `sete` is used which tests the ZF flag. GCC13 inserts the `jp` instruction as you said "either checking PF separately". But as you also said there is a better solution by checking with `>` or `>=` which is done by the builtin. — jerryct, Jun 01 '23 at 16:18
@jerryct: Yeah that's unsurprising, it's the same bug affecting other FP compare predicates. Interesting point that we can optimize the source to `abs >= inf` to allow `seta`, instead of `setnp` / `cmovne` (scalar `==`) or a branch (intrinsic). https://godbolt.org/z/3dbGso5n1 . GCC misses that optimization, but it's safe because nothing can ever be larger than inf so we only get the == part. In asm it allows `seta` to check ZF and CF in one uop. — Peter Cordes, Jun 01 '23 at 16:33

What causes the different NaN behavior when compiling `_mm_ucomilt_ss` intrinsic?

1 Answers1

Linked