I'm investigating how to detect in which lanes of a SIMD register the floats are either +/- infinity or +/- nan. After seeing some weird behavior at runtime, I decided to throw things into Godbolt to investigate, and things are weird: https://godbolt.org/z/TdnrK8rqd
#include <immintrin.h>
#include <cstdio>
#include <limits>
#include <cstdint>
static constexpr float inf = std::numeric_limits<float>::infinity();
static constexpr float qnan = std::numeric_limits<float>::quiet_NaN();
static constexpr float snan = std::numeric_limits<float>::signaling_NaN();
int main() {
__m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);
__m256 mask = _mm256_sub_ps(a, a);
// Extract masks as integers
int mask_bits = _mm256_movemask_ps(mask);
std::printf("Mask for INFINITY or NaN: 0x%x\n", mask_bits);
#define PRINT_ALL
#ifdef PRINT_ALL
float data_field[8];
float mask_field[8];
_mm256_storeu_ps(data_field, a);
_mm256_storeu_ps(mask_field, mask);
for (int i = 0; i < 8; ++i) {
std::printf("isfinite(%f) = %x = %f\n", data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
}
#endif
return 0;
}
Compilers give different results, and produce even different results depending on the optimization level. Some compilers are just fully executing the code at compile time with (broken?) reasoning and it all compiles down to some hard-coded print statements, without actual calculations at runtime. Changing the optimization level causes some compilers to trigger this (incorrect?) optimizations?
Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL
option), it seems.
The printed mask differs widely:
- Without
PRINT_ALL
:- GCC 13.1
-O0
: 0xac - new NaNs are-nan
; preserve sign of input NaN. - GCC 13.1
-O1
: 0x5c - new NaNs are-nan
, flip sign of input NaNs. - Clang 16.0.0
-O0
: 0xac - Clang 16.0.0
-O1
: 0xa0 - new NaNs are+nan
; preserve sign of input NaN. - ICX 2022.2.1
-O0
: 0xac - ICX 2022.2.1
-O1
: 0xa0
- GCC 13.1
- With
PRINT_ALL
, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.- GCC 13.1
-O0
: 0xac - GCC 13.1
-O1
: 0xac - Clang 16.0.0
-O0
: 0xac - Clang 16.0.0
-O1
: 0xa0 - ICX 2022.2.1
-O0
: 0xac - ICX 2022.2.1
-O1
: 0xa0
- GCC 13.1
A "new NaN" is inf - inf
or -inf - -inf
, where the result is NaN but neither input was NaN. These form the high 2 bits of the low hex digit, the 0x?C
or 0x?0
. The low 2 bits of that nibble come from the 0-0
and 1-1
elements, which produce +0.0
output as required for finite same-same
with rounding modes other than towards -Inf (which isn't the default.)
ICX and clang seem to agree with each other, but still differ in results depending on the optimization level. I'm guessing 0xac is the correct result, as that is what happens in -O0
and all results are actually calculated by the CPU at runtime, without the compiler trying to be clever.
Bottom line, my question is, is this "expected behavior" according to some rules I am not aware of, or did I find a bug in three different compilers (GCC, Clang, and ICX)? (I couldn't test MSVC, as Goldbolt doesn't support executing the code for those builds.)
(-fno-strict-aliasing
doesn't affect the results, so ((int32_t*)(char*)mask_field)[i]
wasn't causing this.)