Inconsistent results between GCC and Clang for simple floating point calculation

Question

On my Intel x86_64 machine, this C++ code generates different sequences on Clang vs GCC:

#include <iostream>

namespace {

template<typename Out>
constexpr auto caster{[](auto x) constexpr {
        return static_cast<Out>(x);
}};

}  // namespace

auto main() -> int {
        constexpr auto fl{caster<double>};

        constexpr double ellipse_b_start{1.0};
        constexpr double ellipse_b_end{150.0};
        constexpr long ellipse_b_count{12347};

        constexpr double ellipse_b_step{(ellipse_b_end - ellipse_b_start) /
                                        fl(ellipse_b_count)};

        std::ios::sync_with_stdio(false);
        std::cout << std::hexfloat;

        for (long i{0}; i < ellipse_b_count; i++) {
                auto ellipse_b{ellipse_b_start + ellipse_b_step * fl(i)};

                std::cout << ellipse_b << '\n';
        }
}

Addition and multiplication are well-defined by IEEE 754, so I expected my sequence would likewise be a mathematical constant.

Traditionally the Intel x87 extended precision floating-point registers would be blamed for this. But this is a modern Intel x86_64 CPU, so presumably AVX or SSE are used for floating-point instead of x87?

My questions

What is the reason for the different behavior between GCC and Clang?
How can I get the exact same sequence of numbers on both compilers? The numbers from the sequence should be quickly generated.
Is this a manifestation of a bug in Clang?
Is this a manifestation of a bug in GCC?

-ffp-contract=off

Eric Postpischil proposed this compiler option as a solution. While it perhaps is a fix to this problem, it is problematic as a solution when applied to my complete code (the above is just an example), because the compiler option would apply to the entire compilation unit, which would be undesirable for performance and other reasons.

Additional information

The GCC version 11.1.0.

Clang is 12.0.1.

Both GCC and Clang compile my code according to these options:

-std=c++20 -pedantic -g -march=native -flto -O3 -fno-exceptions

The CPU is i5-8300H.

I can also provide the binaries if someone's wants to take a look.

Context

The motivation for the code was comparing several different implementations of an analytical function, where the sequence in question provides inputs on which the different implementations are to be compared. This is why I want the sequences to be predictable even across compilers. I basically want to be able to consider the sequence of inputs as fixed/written in stone.

Examples of differing parts of the sequence

GCC:

...
0x1.59973622ca91bp+0
0x1.5cae14b13b7c3p+0
0x1.5fc4f33fac66cp+0
0x1.62dbd1ce1d515p+0
0x1.65f2b05c8e3bdp+0
...

Clang:

...
0x1.59973622ca91bp+0
0x1.5cae14b13b7c4p+0
0x1.5fc4f33fac66cp+0
0x1.62dbd1ce1d515p+0
0x1.65f2b05c8e3bep+0
...

Clang's sequence and GCC's sequence do tend to synchronize, there are never many inconsistent points in a row.

Ghidra decompilation for Clang

int main(void)

{
        undefined auVar1 [16];
        basic_ostream *pbVar2;
        long lVar3;
        long in_FS_OFFSET;
        undefined in_XMM1 [16];
        char local_21;
        long local_20;
        
        local_20 = *(long *)(in_FS_OFFSET + 0x28);
        lVar3 = 0;
        std::ios_base::sync_with_stdio(false);
        *(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) =
             *(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) | 0x104;
        do {
                auVar1 = vcvtsi2sd_avx(in_XMM1,lVar3);
                auVar1 = vmulsd_avx(auVar1,ZEXT816(0x3f88b6f473875453));
                auVar1 = vaddsd_avx(auVar1,ZEXT816(0x3ff0000000000000));
                pbVar2 = std::basic_ostream<char,std::char_traits<char>>::_M_insert_double_
                                   (SUB168(auVar1,0));
                local_21 = '\n';
                std::__ostream_insert_char_std__char_traits_char__(pbVar2,&local_21,1);
                lVar3 = lVar3 + 1;
        } while (lVar3 != 0x303b);
        if (*(long *)(in_FS_OFFSET + 0x28) == local_20) {
                return 0;
        }
                    /* WARNING: Subroutine does not return */
        __stack_chk_fail();
}

Ghidra decompilation for GCC

undefined8 main(void)

{
        undefined auVar1 [16];
        basic_ostream *pbVar2;
        long lVar3;
        long in_FS_OFFSET;
        undefined in_YMM1 [32];
        char local_21;
        long local_20;
        
        lVar3 = 0;
        local_20 = *(long *)(in_FS_OFFSET + 0x28);
        std::ios_base::sync_with_stdio(false);
        *(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) =
             *(uint *)(_ITM_deregisterTMCloneTable + *(long *)(std::cout + -0x18)) | 0x104;
        do {
                auVar1 = vxorpd_avx(SUB3216(in_YMM1,0),SUB3216(in_YMM1,0));
                in_YMM1 = ZEXT1632(auVar1);
                auVar1 = vcvtsi2sd_avx(auVar1,lVar3);
                lVar3 = lVar3 + 1;
                auVar1 = vfmadd132sd_fma(auVar1,ZEXT816(0x3ff0000000000000),
                                         ZEXT816(0x3f88b6f473875453));
                pbVar2 = std::basic_ostream<char,std::char_traits<char>>::_M_insert_double_
                                   (SUB168(auVar1,0));
                local_21 = '\n';
                std::__ostream_insert_char_std__char_traits_char__(pbVar2,&local_21,1);
        } while (lVar3 != 0x303b);
        if (local_20 == *(long *)(in_FS_OFFSET + 0x28)) {
                return 0;
        }
                    /* WARNING: Subroutine does not return */
        __stack_chk_fail();
}

Notice how GCC does a fused multiply-add operation, while Clang doesn't. I guess that could be the reason for the differences? But is there a nice way to prevent the differences in the sequence's terms?

I previously said that I would accept an inline assembly solution, but now that I think about that, I actually want a cross-platform solution. If there is no better way, I'll just try using -ffp-contract.

Please show a [mre] within the question without relying on external links — Alan Birtles, Jul 24 '21 at 06:42
*How can I get the exact same sequence of numbers on both compilers?* -- Honestly, did you expect exact results performing floating point calculations from two different compilers made by different vendors, and add to that the various compiler options and optimizations that can occur? That's wishful thinking. Also, it isn't a good thing to mention "bug" in a compiler, when there is no proof of one. That's always been a sore point to the ones responsible for creating the compiler. — PaulMcKenzie, Jul 24 '21 at 06:55
As to your question, [what you are encountering is not new, it is quite old and well-known](https://stackoverflow.com/questions/588004/is-floating-point-math-broken). Heck, entire periodicals and chapters in books have been written on this. — PaulMcKenzie, Jul 24 '21 at 07:11
Your repo has an external dependency which makes the problem difficult for others to reproduce. — n. m. could be an AI, Jul 24 '21 at 07:57
@PaulMcKenzie You link show that floating point math has unexpected behaviour, but this question is about the consistency of that behaviour. — gerum, Jul 24 '21 at 08:29
Maybe looking at the generated assembler may give more insight? — Galik, Jul 24 '21 at 08:37
@PaulMcKenzie: The question you link to and its answers does not say floating-point arithmetic is not reproducible. It says floating-point result arithmetic approximates real arithmetic and is typically binary based, so its rounding errors produce results different from what one gets with the decimal-based arithmetic humans are used to from early schooling. That has nothing to do with this question. — Eric Postpischil, Jul 24 '21 at 11:22
Your program uses PARI. First, you should figure out if the irreproducibility arises in your software or PARI. Examine all inputs to PARI functions to see if your code provides different values to PARI in the different C implementations or if PARI is returning different values given the same inputs. If the irreproducibility arises in PARI, there might not be much you can do about it. A brief check shows no mention of reproducibility in the PARI documentation, so the developers might not have controlled for it… — Eric Postpischil, Jul 24 '21 at 11:32
… I note that PARI can be built with GMP or its own arbitrary-precision code. So I suggest building it with GMP instead of its own code (or vice-versa), as possibly the irreproducibility lies in that code and replacing it with GMP would give reproducible results. — Eric Postpischil, Jul 24 '21 at 11:34
I don't understand the justification for closing this question. I received a private clarification that says this "Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.". That's some generic wording that's not relevant to the question. — user2373145, Jul 24 '21 at 12:40
I mean, OK, a more minimal reproducer *would* be nice, I can see that, but is that why the question was closed? The reproducer isn't that complicated. I'm not sure I can reduce it... — user2373145, Jul 24 '21 at 12:41
You should edit your question to include a [mcve]. That is, directly compilable *actual code* which demonstrates the problem. — Sneftel, Jul 24 '21 at 12:42
@Sneftel OK, I'll definitely try to create a more minimal reproducer, but is that really grounds for closing this question? — user2373145, Jul 24 '21 at 12:46
@user2373145: Yes, absolutely. Stack Overflow is not a personal debugging service. Participants are not expected to fetch your code from some third-party site, download the library (PARI) it uses, configure the library (with unknown options, since you did not state them), build and install that library, build your code, and debug it for you. Stack Overflow is intended to be a durable repository of specific questions and answers to serve readers in the future. External links are not durable: The data at them changes or vanishes over time… — Eric Postpischil, Jul 24 '21 at 12:57
… Ideally, a Stack Overflow question should be a succinct statement of a clear problem whose answer is educational to other people. — Eric Postpischil, Jul 24 '21 at 12:58
Yes. Once the question is edited, it’s automatically nominated for reopening. With that said, you already edited the question without fixing the problems with the original question, so it might have to be manually nominated for reopening. — Sneftel, Jul 24 '21 at 12:58
@EricPostpischil I understand that, I just think that someone may be able to answer the question without doing all that if it weren't closed. — user2373145, Jul 24 '21 at 12:58
@user2373145: I did download your code, and I downloaded PARI, and I configured it, and I tried to build it and got a syntax error from bison. That’s a reason Stack Overflow questions should be self-contained with **minimal** reproducible examples: Readers should not have to resolve version and compatibility issues. They have nothing to do with the underlying problem. The burden of eliminating these and reducing the problem to something minimal is yours. — Eric Postpischil, Jul 24 '21 at 13:00
Re “ I just think that someone may be able to answer the question…”: Answering **your** questions is not a goal of Stack Overflow. Answering a question tied specifically to your source code, which is not present in the question, and that uses third-party libraries, does not serve the purpose of Stack Overflow. How can **somebody else** learn from your question? That calls for the presented question to illustrate some issue useful to other people, so they need to see the code with the issue **in the the question**. — Eric Postpischil, Jul 24 '21 at 13:05
@EricPostpischil I mean, sorry about that, but I didn't ask or expect anyone to build my code. My point is I don't see why there couldn't hypothetically be a good answer without building the code. — user2373145, Jul 24 '21 at 13:09
If it isn't clear I'll try to make a more minimal reproducer. — user2373145, Jul 24 '21 at 13:14
@user2373145: If you want a hypothetical answer, see clause 11, “Reproducible floating-point results,” in the 2008 or 2019 IEEE-754 floating-point standard. Some of the most common causes of non-reproducibility include using different implementations of math library routines, using different precisions during calculations (permitted by the C and C++ standards), using value-changing optimizations (including fused multiply-add contractions for separate multiplication and addition), and parallelizing computations with inadequate controls for reproducibility. — Eric Postpischil, Jul 24 '21 at 13:27
In any case, as a first step, or at least a preliminary step before asking for any assistance on Stack Overflow, you should isolate the problem to your code or PARI. In the absence of parallelism, this is simple: Instrument each call to a PARI routine so that its inputs are printed to a log file with full precision and its outputs are printed to a log file with full precision. Then run the program with GCC and with Clang. See whether the first difference occurs in the inputs (your code generated different results) or outputs (PARI did)… — Eric Postpischil, Jul 24 '21 at 13:29
… In the former case, identify the section of your code that generated the different results, strip out the rest of the program, and make that a reproducible example. In the latter case, PARI is at issue, and you have logged data showing what was passed, so you can isolate it to a single PARI call. — Eric Postpischil, Jul 24 '21 at 13:31
Sorry for the noise, this turns out to be trivial to reproduce. — user2373145, Jul 24 '21 at 13:36
Re “GCC does a fused multiply-add operation”: For this specific issue, compile with `-ffp-contract=off`. — Eric Postpischil, Jul 24 '21 at 14:07
If you want FMA instructions for both compilers, you could replace `ellipse_b_start + ellipse_b_step * fl(i)` by `std::fma(ellipse_b_step, fl(i), ellipse_b_start)` (this would be quite inefficient when compiled for a CPU without FMA, though). And of course, if you have a large code-base this would be quite difficult to do, everywhere. And you also need to make sure that you link to the same math-libraries. — chtz, Jul 24 '21 at 23:56
user2373145 "What is the reason for the different behavior between GCC and Clang?" Please post here a sample of the 2 different outputs. — chux - Reinstate Monica, Jul 27 '21 at 12:01
user2373145, "What is the reason for the different behavior between GCC and Clang?" --> From a C perspective, which I think C++ inherits here, code like `ellipse_b_start + ellipse_b_step * fl(i)` can be done use `double` math (All objects are `double`) or `long double` math (even with all `double` objects), depending on of the `int` macro `FLT_EVAL_METHOD` resulting in a different value. To force a `double` intermediate, try `double product = ellipse_b_step * fl(i); ellipse_b_start + product;`. — chux - Reinstate Monica, Jul 27 '21 at 12:01
@chux-ReinstateMonica TBH, you obviously didn't read the question. The sample is already there, and in fact the question is basically answered already (partly by me in the question edits, partly by Eric and chtz in their comments here). — user2373145, Jul 28 '21 at 13:17