On my Intel x86_64 machine, this C++ code generates different sequences on Clang vs GCC:
#include <iostream>
namespace {
template<typename Out>
constexpr auto caster{[](auto x) constexpr {
return static_cast<Out>(x);
}};
} // namespace
auto main() -> int {
constexpr auto fl{caster<double>};
constexpr double ellipse_b_start{1.0};
constexpr double ellipse_b_end{150.0};
constexpr long ellipse_b_count{12347};
constexpr double ellipse_b_step{(ellipse_b_end - ellipse_b_start) /
fl(ellipse_b_count)};
std::ios::sync_with_stdio(false);
std::cout << std::hexfloat;
for (long i{0}; i < ellipse_b_count; i++) {
auto ellipse_b{ellipse_b_start + ellipse_b_step * fl(i)};
std::cout << ellipse_b << '\n';
}
}
Addition and multiplication are well-defined by IEEE 754, so I expected my sequence would likewise be a mathematical constant.
Traditionally the Intel x87 extended precision floating-point registers would be blamed for this. But this is a modern Intel x86_64 CPU, so presumably AVX or SSE are used for floating-point instead of x87?
My questions
- What is the reason for the different behavior between GCC and Clang?
- How can I get the exact same sequence of numbers on both compilers? The numbers from the sequence should be quickly generated.
- Is this a manifestation of a bug in Clang?
- Is this a manifestation of a bug in GCC?
-ffp-contract=off
Eric Postpischil proposed this compiler option as a solution. While it perhaps is a fix to this problem, it is problematic as a solution when applied to my complete code (the above is just an example), because the compiler option would apply to the entire compilation unit, which would be undesirable for performance and other reasons.
Additional information
The GCC version 11.1.0.
Clang is 12.0.1.
Both GCC and Clang compile my code according to these options:
-std=c++20 -pedantic -g -march=native -flto -O3 -fno-exceptions
The CPU is i5-8300H.
I can also provide the binaries if someone's wants to take a look.
Context
The motivation for the code was comparing several different implementations of an analytical function, where the sequence in question provides inputs on which the different implementations are to be compared. This is why I want the sequences to be predictable even across compilers. I basically want to be able to consider the sequence of inputs as fixed/written in stone.
Examples of differing parts of the sequence
GCC:
...
0x1.59973622ca91bp+0
0x1.5cae14b13b7c3p+0
0x1.5fc4f33fac66cp+0
0x1.62dbd1ce1d515p+0
0x1.65f2b05c8e3bdp+0
...
Clang:
...
0x1.59973622ca91bp+0
0x1.5cae14b13b7c4p+0
0x1.5fc4f33fac66cp+0
0x1.62dbd1ce1d515p+0
0x1.65f2b05c8e3bep+0
...
Clang's sequence and GCC's sequence do tend to synchronize, there are never many inconsistent points in a row.
Ghidra decompilation for Clang
int main(void)
{
undefined auVar1 [16];
basic_ostream *pbVar2;
long lVar3;
long in_FS_OFFSET;
undefined in_XMM1 [16];
char local_21;
long local_20;
local_20 = *(long *)(in_FS_OFFSET + 0x28);
lVar3 = 0;
std::ios_base::sync_with_stdio(false);
*(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) =
*(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) | 0x104;
do {
auVar1 = vcvtsi2sd_avx(in_XMM1,lVar3);
auVar1 = vmulsd_avx(auVar1,ZEXT816(0x3f88b6f473875453));
auVar1 = vaddsd_avx(auVar1,ZEXT816(0x3ff0000000000000));
pbVar2 = std::basic_ostream<char,std::char_traits<char>>::_M_insert_double_
(SUB168(auVar1,0));
local_21 = '\n';
std::__ostream_insert_char_std__char_traits_char__(pbVar2,&local_21,1);
lVar3 = lVar3 + 1;
} while (lVar3 != 0x303b);
if (*(long *)(in_FS_OFFSET + 0x28) == local_20) {
return 0;
}
/* WARNING: Subroutine does not return */
__stack_chk_fail();
}
Ghidra decompilation for GCC
undefined8 main(void)
{
undefined auVar1 [16];
basic_ostream *pbVar2;
long lVar3;
long in_FS_OFFSET;
undefined in_YMM1 [32];
char local_21;
long local_20;
lVar3 = 0;
local_20 = *(long *)(in_FS_OFFSET + 0x28);
std::ios_base::sync_with_stdio(false);
*(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) =
*(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) | 0x104;
do {
auVar1 = vxorpd_avx(SUB3216(in_YMM1,0),SUB3216(in_YMM1,0));
in_YMM1 = ZEXT1632(auVar1);
auVar1 = vcvtsi2sd_avx(auVar1,lVar3);
lVar3 = lVar3 + 1;
auVar1 = vfmadd132sd_fma(auVar1,ZEXT816(0x3ff0000000000000),
ZEXT816(0x3f88b6f473875453));
pbVar2 = std::basic_ostream<char,std::char_traits<char>>::_M_insert_double_
(SUB168(auVar1,0));
local_21 = '\n';
std::__ostream_insert_char_std__char_traits_char__(pbVar2,&local_21,1);
} while (lVar3 != 0x303b);
if (local_20 == *(long *)(in_FS_OFFSET + 0x28)) {
return 0;
}
/* WARNING: Subroutine does not return */
__stack_chk_fail();
}
Notice how GCC does a fused multiply-add operation, while Clang doesn't. I guess that could be the reason for the differences? But is there a nice way to prevent the differences in the sequence's terms?
I previously said that I would accept an inline assembly solution, but now that I think about that, I actually want a cross-platform solution. If there is no better way, I'll just try using -ffp-contract.