C++ compilers give different signs of NaN for constant propagation of subtracting +-Infinity or +-NaN from itself in AVX SIMD code

Question

I'm investigating how to detect in which lanes of a SIMD register the floats are either +/- infinity or +/- nan. After seeing some weird behavior at runtime, I decided to throw things into Godbolt to investigate, and things are weird: https://godbolt.org/z/TdnrK8rqd

#include <immintrin.h>
#include <cstdio>
#include <limits>
#include <cstdint>

static constexpr float inf = std::numeric_limits<float>::infinity();
static constexpr float qnan = std::numeric_limits<float>::quiet_NaN();
static constexpr float snan = std::numeric_limits<float>::signaling_NaN();

int main() {
    __m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);

    __m256 mask = _mm256_sub_ps(a, a);

    // Extract masks as integers 
    int mask_bits = _mm256_movemask_ps(mask);

    std::printf("Mask for INFINITY or NaN: 0x%x\n", mask_bits);

    #define PRINT_ALL
    #ifdef PRINT_ALL
    float data_field[8];
    float mask_field[8];
    _mm256_storeu_ps(data_field, a);
    _mm256_storeu_ps(mask_field, mask);
    for (int i = 0; i < 8; ++i) {
        std::printf("isfinite(%f) = %x = %f\n", data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
    }
    #endif
    
    return 0;
}

Compilers give different results, and produce even different results depending on the optimization level. Some compilers are just fully executing the code at compile time with (broken?) reasoning and it all compiles down to some hard-coded print statements, without actual calculations at runtime. Changing the optimization level causes some compilers to trigger this (incorrect?) optimizations?

Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL option), it seems.

The printed mask differs widely:

Without PRINT_ALL:
- GCC 13.1 -O0: 0xac - new NaNs are -nan; preserve sign of input NaN.
- GCC 13.1 -O1: 0x5c - new NaNs are -nan, flip sign of input NaNs.
- Clang 16.0.0 -O0: 0xac
- Clang 16.0.0 -O1: 0xa0 - new NaNs are +nan; preserve sign of input NaN.
- ICX 2022.2.1 -O0: 0xac
- ICX 2022.2.1 -O1: 0xa0
With PRINT_ALL, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.
- GCC 13.1 -O0: 0xac
- GCC 13.1 -O1: 0xac
- Clang 16.0.0 -O0: 0xac
- Clang 16.0.0 -O1: 0xa0
- ICX 2022.2.1 -O0: 0xac
- ICX 2022.2.1 -O1: 0xa0

A "new NaN" is inf - inf or -inf - -inf, where the result is NaN but neither input was NaN. These form the high 2 bits of the low hex digit, the 0x?C or 0x?0. The low 2 bits of that nibble come from the 0-0 and 1-1 elements, which produce +0.0 output as required for finite same-same with rounding modes other than towards -Inf (which isn't the default.)

ICX and clang seem to agree with each other, but still differ in results depending on the optimization level. I'm guessing 0xac is the correct result, as that is what happens in -O0 and all results are actually calculated by the CPU at runtime, without the compiler trying to be clever.

Bottom line, my question is, is this "expected behavior" according to some rules I am not aware of, or did I find a bug in three different compilers (GCC, Clang, and ICX)? (I couldn't test MSVC, as Goldbolt doesn't support executing the code for those builds.)

(-fno-strict-aliasing doesn't affect the results, so ((int32_t*)(char*)mask_field)[i] wasn't causing this.)

Pointing `(int32_t*)` at a `__m256` object is strict-aliasing UB. Try with a safe way of printing the bit-patterns as integers, like store to an `int32_t` array, as in [print a \_\_m128i variable](https://stackoverflow.com/q/13257166). Or just try `gcc -fno-strict-aliasing`. The `0xac` mask result is separate from that, though, so probably what you're seeing is separate. — Peter Cordes, May 07 '23 at 14:18
@PeterCordes: Good Point! Although I did realize that, and added the implicit cast to `(char*)` in the middle, which should satisfy the strict aliasing rules, AFAIK? Or am I mistaken? — Martijn Courteaux, May 07 '23 at 14:23
You're mistaken. Casting through `char*` doesn't sidestep strict-aliasing, It's the type of the actual dereference that has to match the object. `char*` would only help if you were doing `memcpy(&tmp, ((char*)mask_field) + i * sizeof(tmp), sizeof(tmp));` or actual `char*` derefs. — Peter Cordes, May 07 '23 at 14:25
ICX uses the same LLVM back-end as clang so it's unsurprising they agree. Although if ICX is like ICC, it might default to `-ffast-math` which might include assuming finite floats. (ICC defaults to `-ffp-model:fast` which is slightly less aggressive than gcc/clang -ffast-math.) — Peter Cordes, May 07 '23 at 14:29
https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf `When either an input or result is NaN, this standard does not interpret the sign of a NaN` I would say the standard does not specify the sign of the result, so it is unspecified, so both signs are ok. — KamilCuk, May 07 '23 at 15:05

amonakov · Accepted Answer · 2023-05-07T15:50:03.463

Sign of a NaN result is specified only for abs, copysign, and unary minus. Otherwise, the sign is unspecified. When both operands of an SSE instruction are not NaN, x86 CPUs produce a negative NaN, but compilers are not obligated to simulate that when optimizing.

Therefore only the low two bits of the mask_bits variable are predictable.

For instance, in case of gcc -O1, the compiler internally tranforms _mm256_sub_ps(a, a) to a + b, where b is a constant vector with the same contents as a, but all signs flipped. After that it emits the vaddps instruction with those constant vectors on registers, and the bits in the high nibble of the result depend on the order of operands (the CPU copies the NaN from one of the operands).

LLVM folds subtraction of infinities to positive NaN, where the CPU produces a negative NaN: https://godbolt.org/z/1hd69josr

Thanks for the very insightful reply. I could confirm that indeed, the IEEE standard does not specify -NaN. It just says that there should be "at least one NaN", and they never specify anything about what the sign of it should be. — Martijn Courteaux, May 08 '23 at 11:49

C++ compilers give different signs of NaN for constant propagation of subtracting +-Infinity or +-NaN from itself in AVX SIMD code

1 Answers1