4

I have some problems while porting some complex under macOS/arm64 and ended up with the following trivial code to exhibit the different behavior w.r.t. macOS/x86_64 (using native osx/arm64 clang version 14.0.6 from conda-forge, and cross compiling for x86_64):

#include "assert.h"
#include "stdio.h"
int main()
{
    double y[2] = {-0.01,0.9};
    double r;
    r = y[0]+0.03*y[1];
    printf("r = %24.26e\n",r);
    assert(r == 0.017);
}

The results on arm64 is

$ clang -arch arm64 test.c -o test; ./test
Assertion failed: (r == 0.017), function main, file test.c, line 9.
r = 1.69999999999999977517983751e-02
zsh: abort      ./test

while the result on x86_64 is

$ clang -arch x86_64 test.c -o test; ./test
r = 1.70000000000000012212453271e-02
$       

The test program has also been compiled/run on a x86_64 machine, it yields the same result as above (cross compiled on arm64 and run with Rosetta).

In fact it doesn't matter that the arm64 result is not bitwise equal to 1.7 parsed and stored as a IEEE754 number, but rather the different value of the expression w.r.t. x86_64.

Update 1:

In order to check eventual different conventions (e.g. rounding mode), the following program has been compiled and run on both platforms

#include <iostream>
#include <limits>

#define LOG(x) std::cout << #x " = " << x << '\n'

int main()
{
    using l = std::numeric_limits<double>;
    LOG(l::digits);
    LOG(l::round_style);
    LOG(l::epsilon());
    LOG(l::min());

    return 0;
}

it yields the same result:

l::digits = 53
l::round_style = 1
l::epsilon() = 2.22045e-16
l::min() = 2.22507e-308

hence the problem seems to be elsewhere.

Update 2:

If it can help: under arm64 the result obtained with the expression is the same as the one obtained by calling refBLAS ddot with vectors {1,0.03} and y.

Update 3:

The toolchain seems to be the cause. Using the default toolchain of macOS 11.6.1:

mottelet@portmottelet-cr-1 ~ % clang -v
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

gives the same results for both architecture ! So the problem seems to be in the actual toolchain I am using: I use the version 1.5.2 of conda package cxx-compiler (I need conda as a package manager because the application I am building has a lot of dependencies that conda provides me).

Using -v shows a bunch of compilation flags, which one would be eventually incriminated ?

Stéphane Mottelet
  • 2,853
  • 1
  • 9
  • 26
  • 1
    Run this on both platforms: https://godbolt.org/z/jn9zGjWj4 – Marek R Jun 05 '23 at 15:45
  • @Marek R it yields exactly (and hopefully) the same figures: l::digits = 53 l::round_style = 1 l::epsilon() = 2.22045e-16 l::min() = 2.22507e-308 – Stéphane Mottelet Jun 05 '23 at 15:52
  • 1
    So apparently the problem is more complex then we expected. You could print hex representation of values for each step to spot moment when three is a difference. I do not have M1 to test this myself. – Marek R Jun 05 '23 at 15:54
  • If it can help: under arm64 the result obtained with the expression is the same as the one obtained by calling refBLAS `ddot` with vectors `{1,0.03}` and `y`. – Stéphane Mottelet Jun 05 '23 at 15:57
  • Could you post the generated assembly of both versions? – Nate Eldredge Jun 05 '23 at 16:00
  • 1
    BTW I really recommend you to update your question with any extra detail provided in comments, so next SO user do not have to read comments. – Marek R Jun 05 '23 at 16:01
  • 1
    Different rounding modes could also be in effect. I don't recall off the top of my head how to query the rounding mode, but you might try to find out. By the way, is the x86 version being run on actual x86 hardware, or on arm64 via Rosetta emulation? – Nate Eldredge Jun 05 '23 at 16:06
  • @Nate Eldredge : I tried both and the result is the same (x86_64 cross compiled code gives the same result on arm64/Rosetta as native x86_64 compiled version on a x86_64 machine, run on the same machine). – Stéphane Mottelet Jun 05 '23 at 16:13
  • 2
    I think you may also have to add your toolchain version. I'm getting the `1.70000000000000012212453271e-02` result on arm64, with anything from `-O0` to `-O3`. – Siguza Jun 05 '23 at 16:17
  • Maybe OT, but shouldn't you get `r = 1.699999....` output first and then `Assertion failed: ...` ?? – Jabberwocky Jun 05 '23 at 16:24
  • @Siguza you are right both results agree when using the default macOS toolchain (I updated the question with this information). But the question remains open (what has to be changed in the toolchain I use to have the same results) ? – Stéphane Mottelet Jun 05 '23 at 16:51
  • Thanks everybody for your insights I will open a new question focused on the toolchain problem in conda. – Stéphane Mottelet Jun 05 '23 at 17:28
  • You mentioned that using -v shows a bunch of compilation flags. You should add that list of flags to the question. There are flags like `-ffast-math` and `-ffp-contract` that affect floating point accuracy. – user3386109 Jun 05 '23 at 21:49
  • I forced the computation to actually occur at run time, and there really is a difference in the result on the two architectures, due to the use of the `fmadd` (combined multiply and add) instruction on the arm64, vs. separate multiply and add on the x86_64. – Mark Adler Jun 05 '23 at 22:00

2 Answers2

5

The results differ in the least significant bit due to different rounding given the compilers and architectures. You can use %a to see all of the bits in the double in hex. Then you get on arm64:

0x1.16872b020c49bp-6

and on x86_64:

0x1.16872b020c49cp-6

The IEEE 754 standard by itself does not guarantee exactly the same results across conforming implementations, in particular due to destination accuracy, decimal conversions, and instruction choices. Variations in the least significant bit, or more with multiple operations, can and should be expected.

In this case, the fmadd operation on the arm64 architecture is used, doing the multiply and add in a single operation. That gives a different result than the separate multiply and add XMM operations used in the x86_64 architecture.

In the comments, Eric points out the C library function fma() to do a combined multiply-add. Indeed, if I use that call on the x86_64 architecture (as well as on arm64), I get the arm64 fmadd result.

You could potentially get different behavior in the same architecture if the compiler optimizes away the operation, as it should in the example. Then the compiler is doing the computation. The compiler could very well use separate multiply and add operations at compile time, giving a different result on arm64 than the fmadd operation when not optimized out. Also if you're cross-compiling, then the optimized-out calculation could depend the architecture of the machine you're compiling on, as opposed to the one you're running it on.

Comparison for exact equality of floating point values is fraught with peril. Whenever you see yourself attempting that, you need to think more deeply about your intent.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thanks, but as I have precised above, the results are the same on x86_64 and arm64 when using default macOS toolchain (which I cannot use). – Stéphane Mottelet Jun 05 '23 at 16:47
  • The IEEE 754 standard does guarantee exactly the same results for these operations across conforming implementations. The C standard does not guarantee the IEEE 754 standard is conformed to, nor do common compilers guarantee conformance to the C standard in all modes. If IEEE 754 were conformed to, the results here would be identical. – Eric Postpischil Jun 05 '23 at 20:37
  • @EricPostpischil No, not necessarily. IEEE 754 does not assure as much as you seem to think it does. See https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#3098 . – Mark Adler Jun 05 '23 at 20:43
  • @MarkAdler: The operations involved here are multiplication and addition. IEEE 754-2019 specifies exactly which result is produced for those operations as a function of the rounding method in clause 4.3. Any two compilers conforming to IEEE 754 and using the same binding to IEEE 754 (e.g., those specified in Annex F of the C standard, including mapping `double` to the binary64 format and using round-to-nearest-ties-to-even for the default rounding method) must produce the same results for multiplication and addition. – Eric Postpischil Jun 05 '23 at 21:14
  • @EricPostpischil You are assuming that both machines are doing a separate multiply and add. They are not. The arm64 machine is executing an `fmadd` instruction, which does a combined multiply and add, which can have different rounded results. – Mark Adler Jun 05 '23 at 21:38
  • 1
    @MarkAdler: And that is not conforming to IEEE 754. `y[0]+0.03*y[1];` is a multiply and an add. Fused multiply-add is a separate operation in IEEE 754, called fusedMultiplyAdd. There is even a standard C library routine for it, `fma`. – Eric Postpischil Jun 05 '23 at 21:44
  • @EricPostpischil You're claiming that the compiler is somehow forbidden from using a fused multiply-add for that C code? – Mark Adler Jun 05 '23 at 21:45
  • @MarkAdler: No. As I wrote, C compilers do not conform to the C standard in all modes, and, even if they did, the C standard does not require conformance to IEEE 754. I wrote that **if they were conforming** to IEEE 754, the result would be determined. The issue is not that IEEE 754 does not uniquely determine the result (it does, in contradiction to what you say in the answer); the issue is the compilers are not conforming to it. – Eric Postpischil Jun 05 '23 at 23:15
  • What I said is "The IEEE 754 standard _by itself_ does not guarantee exactly the same results across conforming implementations". (Added emphasis.) I did not say that individual IEEE 754 operations are not deterministic. However that determinism, _by itself_ does not assure the same result for the same C code. You would need additional restrictions not part of either the IEEE 754 standard or the C standard. This is a perfect example of that, where the compiler is free to use different IEEE 754 operations to implement the same C code. There is nothing in the C standard that says it can't. – Mark Adler Jun 05 '23 at 23:33
  • So, no, even if the C compiler and machine were fully conforming to the IEEE 754 standard for individual operations, the result from a given C code is _not_ necessarily determined. As shown in this example. (There are no compiler bugs, nor IEEE 754 bugs here.) – Mark Adler Jun 05 '23 at 23:38
3

It appears that clang behavior changed between 13.x and 14.x. When using -O, the result is computed at compile time and the target's floating point has nothing to do with it, so this is strictly a compiler issue.

Try on godbolt

The difference is easier to see in hex float output. clang 13 and earlier computes the value 0x1.16872b020c49cp-6 which is slightly greater than 1.7. clang 14 and later computes 0x1.16872b020c49bp-6 which is slightly less (different by 1 in the least significant bit).

The same discrepancy exists between the two versions whether on arm64 or x86-64.

I am not sure offhand which one is better or worse. I guess you could git bisect if you really care, and look at the rationale for the corresponding commit and see whether it seems to be correct. For comparison, gcc in all versions tested gives the "old clang" value of 0x1.16872b020c49cp-6.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • Does the assembly code I gave in this other topic confirm this: https://stackoverflow.com/questions/76409008/simple-c-program-yields-different-results-on-macos-arm64-depending-on-toolchain ? – Stéphane Mottelet Jun 05 '23 at 18:15
  • The asm output you posted there is incomplete. It looks like the values used in the computation are loaded from memory (probably in a literal pool or .rodata section). but those values are not shown in the asm you posted. They'll probably be `.long` directives with funny decimal or hex values, or something like that. If you compile with optimization, the code will be much shorter and easier to follow, and I expect you'll still see the discrepancy. – Nate Eldredge Jun 05 '23 at 18:21
  • OK I will fix this. But concerning your above analysis, how do you explain the discrepancy between arm64 and x86_64 since the same toolchain is used (conda clang 14.0.6) for both ? – Stéphane Mottelet Jun 05 '23 at 18:42
  • @StéphaneMottelet: I guess that's a separate question. From your original post it wasn't clear to me that you were using the same compiler version on both architectures. – Nate Eldredge Jun 06 '23 at 08:49
  • yes, M1 Mac, same compiler (clang 14.0.6), and just adding `-arch x86_64` to cross-compile the Intel binary. But the cause seems the same: a fmadd (a combined 3 operands mutiplication/addition) in the arm64 asm code but a sequence of mulsd and addsd in the x86_64 asm code. – Stéphane Mottelet Jun 06 '23 at 09:11