Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

Question

I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code.

To my knowledge, any x86 architecture since the Intel 8087 uses a FPU unit prepared to handle IEEE-754 floating point numbers, and I cannot see any reason why the result would be different in different architectures. However, if they were different (namely due to different compiler or different optimization level), would there be some way to produce bit-exact results by just configuring the compiler?

Use SSE/SSE2 instead, it'll be more consistent. It'll run fine in all CPUs within a decade back — phuclv, Nov 26 '14 at 13:50
Hmm, pretty hard to find a code generator these days that still emits FPU instructions. The big 3 all took the breaking change. The usual advice applies, keep your tool chain updated and use a mersenne twister if you need a good source of random digits. — Hans Passant, Nov 26 '14 at 16:25
x86 assembly code will produce bit-exact results (ignoring FDIV bug), expecting C to guarantee anything reasonable is like expecting politicians to do the right thing. — harold, Nov 26 '14 at 17:00
@harold : well that's the reason why there is a standard (IEEE754) and why most compilers conform to it, unless you ask them to break the standard for more optimization (like -ffast-math in gcc) — galinette, Dec 02 '14 at 14:34
@galinette sure, compilers can conform. Or not. For example ICC does -ffast-math by default, and SDCC targeting TI-83+ uses a kind of floats that aren't even in the list defined by IEEE754. — harold, Dec 02 '14 at 14:44
@HansPassant MT is actually an outdated, slow, and not even Crush-resistent RNG. It's most widely used does not make it the best. In today's parallel computing world it is increasingly inadequate for the job, for its speed, size and properties. There are many better choices out there such as counter-based RNGs — Yan Zhou, Dec 06 '16 at 18:10

Peter Cordes · Accepted Answer · 2018-04-22T06:40:30.423

Table of contents:

C/C++
asm
Creating real-life software that achieves this.

In C or C++:

No, a fully ISO C11 and IEEE-conforming C implementation does not guarantee bit-identical results to other C implementations, even other implementations on the same hardware.

(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)

related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):

Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal “yes”. Unfortunately the answer is also an unequivocal “no”. I’m afraid you will need to clarify your question.

If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.

First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.

Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.

But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.

C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.

On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)

So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).

In assembly:

With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.

Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.

I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.

Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).

SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.

In real life:

So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...

As @Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.

However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.

In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.

One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.

Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.

The last point is not exactly true. A `libm` might make CPU dispatch and reordering some evaluations within math functions. This will affect results. One example is intel `libimf` and another is AMD libm. — Yan Zhou, Dec 06 '16 at 18:36
This answer is very informative, as usual. But floating point is actually more tricky when external libraries are invlvied. My experience is that unless you control every bit of implementation, you cannot expect bit-exact results across different CPU. — Yan Zhou, Dec 06 '16 at 18:38
@YanZhou: Great point about libraries. Added an "in real life" section with more about the complications. — Peter Cordes, Dec 06 '16 at 23:11
http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html#Errors-in-Math-Functions — Z boson, Dec 08 '16 at 09:33
In terms of assembly what about e.g. signed zero? Is there freedom in hardware implementation here? E.g. `_mm_min_ps(-0.0,0.0)=0.0` and `_mm_max_ps(-0.0,0.0)=-0.0` last time I checked with Intel hardware. Does it have to be this way? `fmin` does not require `-0<0` it only says that this is the preferred software implementation (and GCC and Clang both have `-0<0` for `fmin`). — Z boson, Dec 08 '16 at 10:12
@Zboson: I don't think there's any freedom for signed zero. There's none for MINPS. `_mm_min_ps(a,b)` is a strict IEEE implementation of `a < b ? a : b`. But note that [gcc has only just fixed a long-standing bug where it considers the intrinsics associative](http://stackoverflow.com/questions/40196817/what-is-the-instruction-that-gives-branchless-fp-min-and-max-on-x86). It will use MINSS correctly, but even without `-ffast-math`, `_mm_or_ps(_mm_min_ps(a,b), _mm_min_ps(b,a))` miscompiles by CSEing the two mins with opposite arg order, on gcc before 7.0. — Peter Cordes, Dec 08 '16 at 14:38
"gcc has only just fixed a long-standing bug where it considers the intrinsics associative" did you mean commutative instead of associative? — Z boson, Dec 09 '16 at 08:16
@Zboson: yes, thanks. Those words are easy to mix up, regardless of knowing exactly which property you're thinking of. >. — Peter Cordes, Dec 09 '16 at 14:09

score -3 · Answer 2 · edited May 15 '22 at 11:01

-3

If the compiler and architecture is compliant to IEEE standards, yes.

For instance, gcc is IEEE compliant if configured properly. If you use the -ffast-math flag, it will not be IEEE compliant.

See http://www.validlab.com/goldberg/paper.pdf page 25.

If you want to know exactly what exactness you can rely on when using a IEEE 754-1985 hardware/compiler pair, you need to purchase the standard paper on IEEE site. Unfortunately, this is not publicly available

Link

edited May 15 '22 at 11:01

Glorfindel

21,988
13
81
109

answered Nov 26 '14 at 13:03

galinette

8,896
2
36
87

Does this mean that if I compile the code on a Windows platform using MSVC and also on a Linux platform using GCC, with different processors (but still x86 architecture), I would obtain the exact same result? – Samuel Navarro Lou Nov 26 '14 at 13:16
If the compilers are IEEE compliant, and in IEEE compliant mode, yes. Read the compiler documentation for finding the proper flags, but I believe they are IEEE by default, flags are there for disabling IEEE compliance to gain more speed. – galinette Nov 26 '14 at 13:20
No you cannot expect that. Different compilers will produce different code, use different optimisations and so on. – David Heffernan Nov 26 '14 at 13:28
MSVC has the `/fp` switch to control the behaviour of floating-point operations. [Have a look.](http://msdn.microsoft.com/en-us/library/e7s85ffb.aspx) – Daniel Kamil Kozar Nov 26 '14 at 13:31
@David Heffernan : no, IEEE compliance modes imply no optimization which change the result even slightly. Look at -funsafe-math-optimizations options in https://gcc.gnu.org/onlinedocs/gcc-4.0.2/gcc/Optimize-Options.html – galinette Nov 26 '14 at 18:24
I don't believe that IEEE754 implies that – David Heffernan Nov 26 '14 at 18:28
1

But then you have intermediate values, which may be stored to higher precision. Does the standard mandate that is not allowed? And of course there are operations not defined in IEEE754. What if you do trig for example? – David Heffernan Nov 26 '14 at 19:50
Different platforms typically have different implementations of math library functions like `pow()`, `log()`, and `sin()`, with different speed/accuracy tradeoffs. IEEE does not require operations other than + - / * and sqrt to return results with <= 0.5ulp of error (i.e. correctly-rounded out to the last bit of the mantissa). – Peter Cordes Nov 20 '16 at 02:12
C rules also allow some flexibility in keeping higher precision temporaries, and implementations define http://en.cppreference.com/w/c/types/limits/FLT_EVAL_METHOD appropriately. – Peter Cordes Nov 20 '16 at 02:14

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

2 Answers2

In C or C++:

In assembly:

In real life:

Linked