Simple C program yields different results on macOS/arm64 depending on toolchain

Question

The following simple C code to exhibit the different behavior w.r.t. the toolchain used on macOS 11.6.1:

#include "assert.h"
#include "stdio.h"
int main()
{
    double y[2] = {-0.01,0.9};
    double r;
    r = y[0]+0.03*y[1];
    printf("r = %24.26e\n",r);
    assert(r == 0.017);
}

The results with the default toolchain is

$ clang -v 
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin20.6.0
Thread model: posix

$ clang -arch arm64 test.c -o testxcode; ./testxcode
r = 1.70000000000000012212453271e-02

while the result with conda 23.1.0 and cxx-compiler package (version given below)

$ conda list | grep cxx 
cxx-compiler              1.5.2                hffc8910_0    conda-forge
libcxx                    16.0.3               h4653b0c_0    conda-forge$ clang -arch 
$ clang -v
clang version 14.0.6
Target: arm64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Users/mottelet/mambaforge/envs/scilab_build/bin

$ clang test.c -o testconda; ./testconda             
r = 1.69999999999999977517983751e-02
Assertion failed: (r == 0.017), function main, file test.c, line 9.
zsh: abort      ./testconda

In order to analyse this I have compiled and disassembled the simpler code

int main()
{
    double y[2] = {-0.01,0.9};
    double r = y[0]+0.03*y[1];
}

asm code with default macOS/Xcode tool chain (asmxcode in diff below):

_main:
    sub sp, sp, #0x40
    stp x29, x30, [sp, #0x30]
    add x29, sp, #0x30
    adrp    x8, 1 ; 0x100004000
    ldr x8, [x8] ; literal pool symbol address: ___stack_chk_guard
    ldr x8, [x8]
    stur    x8, [x29, #-0x8]
    adrp    x8, 0 ; 0x100003000
    add x8, x8, #0xfa0
    ldr q0, [x8]
    str q0, [sp, #0x10]
    ldr d0, [sp, #0x10]
    ldr d2, [sp, #0x18]
    adrp    x8, 0 ; 0x100003000
    ldr d1, [x8, #0xf90]
    fmul    d1, d1, d2
    fadd    d0, d0, d1
    str d0, [sp, #0x8]
    adrp    x8, 1 ; 0x100004000
    ldr x8, [x8] ; literal pool symbol address: ___stack_chk_guard
    ldr x8, [x8]
    ldur    x9, [x29, #-0x8]
    subs    x8, x8, x9
    b.ne    0x100003f5c
    mov w0, #0x0
    ldp x29, x30, [sp, #0x30]
    add sp, sp, #0x40
    ret
    bl  0x100003f60 ; symbol stub for: ___stack_chk_fail

asm code with conda tool chain (asmconda file in diff below):

_main:
    sub sp, sp, #0x40
    stp x29, x30, [sp, #0x30]
    add x29, sp, #0x30
    adrp    x8, 1 ; 0x100004000
    ldr x8, [x8] ; literal pool symbol address: ___stack_chk_guard
    ldr x8, [x8]
    stur    x8, [x29, #-0x8]
    adrp    x8, 0 ; 0x100003000
    add x8, x8, #0xfa0
    ldr q0, [x8]
    str q0, [sp, #0x10]
    ldr d2, [sp, #0x10]
    ldr d1, [sp, #0x18]
    adrp    x8, 0 ; 0x100003000
    ldr d0, [x8, #0xf90]
    fmadd   d0, d0, d1, d2
    str d0, [sp, #0x8]
    ldur    x9, [x29, #-0x8]
    adrp    x8, 1 ; 0x100004000
    ldr x8, [x8] ; literal pool symbol address: ___stack_chk_guard
    ldr x8, [x8]
    subs    x8, x8, x9
    b.eq    0x100003f50
    b   0x100003f4c
    bl  0x100003f60 ; symbol stub for: ___stack_chk_fail
    mov w0, #0x0
    ldp x29, x30, [sp, #0x30]
    add sp, sp, #0x40
    ret

Here is the diff

mottelet@portmottelet-cr-1 unit_tests % diff -Naur asmconda asmxcode
--- asmconda    2023-06-05 19:58:14.000000000 +0200
+++ asmxcode    2023-06-05 19:58:20.000000000 +0200
@@ -10,21 +10,21 @@
    add x8, x8, #0xfa0
    ldr q0, [x8]
    str q0, [sp, #0x10]
-   ldr d2, [sp, #0x10]
-   ldr d1, [sp, #0x18]
+   ldr d0, [sp, #0x10]
+   ldr d2, [sp, #0x18]
    adrp    x8, 0 ; 0x100003000
-   ldr d0, [x8, #0xf90]
-   fmadd   d0, d0, d1, d2
+   ldr d1, [x8, #0xf90]
+   fmul    d1, d1, d2
+   fadd    d0, d0, d1
    str d0, [sp, #0x8]
-   ldur    x9, [x29, #-0x8]
    adrp    x8, 1 ; 0x100004000
    ldr x8, [x8] ; literal pool symbol address: ___stack_chk_guard
    ldr x8, [x8]
+   ldur    x9, [x29, #-0x8]
    subs    x8, x8, x9
-   b.eq    0x100003f50
-   b   0x100003f4c
-   bl  0x100003f60 ; symbol stub for: ___stack_chk_fail
+   b.ne    0x100003f5c
    mov w0, #0x0
    ldp x29, x30, [sp, #0x30]
    add sp, sp, #0x40
    ret
+   bl  0x100003f60 ; symbol stub for: ___stack_chk_fail

Question:

How can I obtain the default toolchain asm code with conda toolchain (compilation flags to add/remove) ?

The asm output is incomplete. It looks like the values used in the computation are loaded from memory (probably in a literal pool or .rodata section). but those values are not shown in the asm you posted. They'll probably be .long directives with funny decimal or hex values, or something like that. If you compile with optimization, the code will be much shorter and easier to follow, and I expect you'll still see the discrepancy. — Nate Eldredge, Jun 05 '23 at 18:23
[This answer](https://stackoverflow.com/a/71968617/4142924) states *"The C standard does not fully specify either the `float` format or how rounding is performed"* and the same applies with `double`. I'm not sure that 2 compilers can be forced to give identical code, since the "as-if" rules allows them to produce different code. There isn't any "perfectly compliant" code output, since the C standard does not specifiy it: only how the code should behave. — Weather Vane, Jun 05 '23 at 18:26
`==` and float/double is almost never a good idea. Something like `fabs(a - b) < epsilon` is the comparison you should normally be using. — Dave S, Jun 05 '23 at 18:28
Your two results are the same to 16 decimal digits. This is at the limit (slightly beyond, actually) of the decimal precision of IEEE binary-64 floating-point representation. The two results differ by 1, maybe 2 ulp. That's pretty good correspondance. You might get that much difference or more with the same toolchain by changing compilation options. — John Bollinger, Jun 05 '23 at 18:37
Obtaining perfectly reproducible floating point results is generally problematic. A number of factors can cause minute differences. They can depend on the compiler, the compiler version, the optimization flags, and of course the architecture. For example, some architectures have internal floating point registers that are larger than 64 bits. When used for intermediate computations, this can affect the result. And when optimizing, some floating point operations may be performed at compile time, again with potentially different results. — Tom Karzes, Jun 05 '23 at 18:41
For what its worth, the primary difference between the two appears to be that the Conda version is using an `fmadd` whereas the XCode version is using separate `fmul` and `fadd`. This seems to reflect a difference between code-generation strategies in the two versions of Clang involved. But for production, you really should be compiling with optimization enabled, probably at `-O2` or above. That will probably change the results of both compilers, and it might even make them generate the same machine code as each other. Not that that's something you should rely on in any case. — John Bollinger, Jun 05 '23 at 18:44
[You have already asked this question](https://stackoverflow.com/questions/76407947/trivial-c-program-yields-different-result-in-clang-macos-arm64-and-clang-macos-x) and it looks like the answer there is perfectly adequate. — n. m. could be an AI, Jun 05 '23 at 19:47
@n.m in fact the question is really different. Here the question is about different toolchains (likely different clang versions) and same CPU (here arm64), whereas the other question is about same toolchain (clang 14) and different CPU (arm64 vs x86_64). There were so many comments on the other topic that considering the latter case would have completely puzzled the discussion. — Stéphane Mottelet, Jun 05 '23 at 20:36
@StéphaneMottelet Try `#include .... printf("%d\n", FLT_EVEL_METHOD);` and report the output on the 2 systems. — chux - Reinstate Monica, Jun 06 '23 at 00:05
Don't see a substantial difference between the two, and the answer is the same. You cannot and should not expect the same exact behaviour from different floating point implementations, or even from the same implementation on different days. If you need this, you have a problem with your design. — n. m. could be an AI, Jun 06 '23 at 05:03
No, the question is the same, and the answer is the same. Which is the answer I gave on your original version of this question. `fmadd` gives one result (combined multiply and add in one instruction), and `fmul` + `fadd` gives the other result. The compiler can choose either to implement a multiply and add in a C expression, on either x86_64 or arm64. FMA is available on both. — Mark Adler, Jun 06 '23 at 05:43

Simple C program yields different results on macOS/arm64 depending on toolchain

asm code with default macOS/Xcode tool chain (asmxcode in diff below):

asm code with conda tool chain (asmconda file in diff below):

Question:

0 Answers0

Linked