Standard math functions reproducibility on different CPU's

Question

I am working on project with a lot math calculations. After switching on a new test machine, I have noticed that a lot of tests failed. But also important to notice that tests also failed on my develop machine, and on some machines of other developers. After tracing values and comparing with values from the old machine I found that some functions (At this moment I found only cosine) from math.h sometimes returns slightly different values (for example: 40965.8966304650828827e-01 and 40965.8966304650828816e-01, -3.3088623618085204e-08 and -3.3088623618085197e-08).

New CPU: Intel Xeon Gold 6230R (Intel64 Family 6 Model 85 Stepping 7)

Old CPU: Exact model is unknown (Intel64 Family 6 Model 42 Stepping 7)

My CPU: Intel Core i7-4790K

Tests results doesn't depend on Windows version (7 and 10 were tested).

I have tried to test with binary that was statically linked with standard library to exclude loading of different libraries for different processes and Windows versions, but all results were the same.

Project compiled with /fp:precise, switching to /fp:strict changed nothing.

MSVC from Visual Studio 15 is used: 19.00.24215.1 for x64.

How to make calculations fully reproducible?

You are beyond the precision of a double in the first example. Remember that a double has 15 to 17 digits of precision. — drescherjm, Oct 14 '22 at 20:05
*I have noticed that a lot of tests failed.* -- Well, it isn't a surprise if you're testing the entire floating point value, all the way down to the least significant digit, that you will get differences. Don't know what you were expecting by running these types of tests. — PaulMcKenzie, Oct 14 '22 at 20:06
Sidenote: C versions and C++ versions follow different IEEE 754 standards. Neither C nor C++ follow IEEE 754 to the letter (if I've understood it correctly) - but - if you just change compiler, you can also expect different behavior in this regard. — Ted Lyngmo, Oct 14 '22 at 20:06
Yes, different machines, C standard libraries, compilers, and compilation options, among other things, may produce differences in floating-point computations. Not only library functions, but arithmetic, too. Often such differences are small, but C and C++ do not place a formal limit, and accumulation of error over multiple computations can result in large errors. — John Bollinger, Oct 14 '22 at 20:06
_"How to make calculations fully reproducible?"_ - Don't rely on _exact_ results when it comes to floating point math. — Ted Lyngmo, Oct 14 '22 at 20:08
I suspect you should update your test cases to relax their acceptance criteria *slightly*. Not a lot — you still want to catch real errors — but you don't want to call something a failure when it is, as here, an acceptable and unavoidable variation in the last one or two bits of a floating-point result. I can't tell you exactly how to do this, because it can be a hard problem, with some real subtleties. You might want to retain a consultant with expertise in floating point numerical analysis. — Steve Summit, Oct 14 '22 at 20:11
All your tests did was show the obvious in terms of how floating point works. If you tested maybe 4 or 5 digits of precision, ok. But all 15 / 17 digits? That is bound to fail, if not guaranteed to fail. — PaulMcKenzie, Oct 14 '22 at 20:16
Is `40965.8966304650828827e-01` an argument you give to `cos`? You should not expect full precision for that large arguments. And also, as other have said already, you can expect an error of maybe 1-2ULP even for good `cos` implementations and "small" arguments. Do not expect bitwise identical results on different machines/with different library versions. — chtz, Oct 14 '22 at 20:24
Floating-point results *can* be exact, but often they're not, and usually it's not appropriate to expect them to be. If you were doing Quality Control in a widget manufacturing plant, and if the widgets were supposed to be 17.5 inches long, you would probably check to see that they were 17.5 ±0.01 inches long, or maybe ±0.001 inches, or maybe ±0.0001 inches. But you would *not* insist that they be 17.5±0.00000000001 inches. And for a great many programs that compute floating-point results, the same principle applies. — Steve Summit, Oct 14 '22 at 20:26
QC Person: "I had to throw out a thousand Widgets". Boss: "Why?". QC Person: "Because the length was off by one Angstrom unit". Boss: "You're fired". — PaulMcKenzie, Oct 14 '22 at 20:38
Thanks! I understand that floating-point calculations can't be exact, but I thought that calculation results on CPU's of the same architecture of the same company should be the same. I will think about choosing suitable tolerance or switching to long double. — armoken, Oct 14 '22 at 20:39
@АлександрВащилко Might be that your math library (that contains the implementation of the cosine function) has different versions optimized for different CPU's. So by changing the CPU a different version is chosen which happens to, in this particular case, produce results that are ever so slightly different. — janneb, Oct 14 '22 at 20:47
@janneb, I mentioned that I already tried to link with standard library statically, but it changed nothing — armoken, Oct 14 '22 at 20:56
@armoken Not sure how cpu dispatch is typically done on Windows, but I don't see why it couldn't be done in a static library. — janneb, Oct 14 '22 at 21:04
*or switching to long double* isn't going to fix the problem. — Weather Vane, Oct 14 '22 at 21:16
Does this answer your question? [Why do sin(45) and cos(45) give different results?](https://stackoverflow.com/questions/31509019/why-do-sin45-and-cos45-give-different-results) — phuclv, Oct 15 '22 at 08:54
don't expect bit-level exactness, especially with transcendental functions because they all depend on the quality of the implementation library. Duplicates: [Does any floating point-intensive code produce bit-exact results in any x86-based architecture?](https://stackoverflow.com/q/27149894/995714) [Slight acos precision difference between Clang and Visual C++](https://stackoverflow.com/q/73202732/995714), [Math precision requirements of C and C++ standard](https://stackoverflow.com/q/20945815/995714) — phuclv, Oct 15 '22 at 08:58
more duplicates: [Floating point accuracy with different languages](https://stackoverflow.com/q/58411805/995714), [Is C floating-point non-deterministic?](https://stackoverflow.com/q/24339868/995714), [How can floating point calculations be made deterministic?](https://stackoverflow.com/q/7365790/995714), [Why do sin(45) and cos(45) give different results?](https://stackoverflow.com/q/31509019/995714), [How to keep float/double arithmetic deterministic?](https://stackoverflow.com/q/46796126/995714) — phuclv, Oct 15 '22 at 09:00

Sedenion · Accepted Answer · 2022-10-15T16:17:50.530

Since you are on Windows, I am pretty sure the different results are because the UCRT detects during runtime whether FMA3 (fused-multiply-add) instructions are available for the CPU and if yes, use them in transcendental functions such as cosine. This gives slightly different results. The solution is to place the call set_FMA3_enable(0); at the very start of your main() or WinMain() function, as described here.

If you want to have reproducibility also between different operating systems, things become harder or even impossible. See e.g. this blog post.

In response also to the comments stating that you should just use some tolerance, I do not agree with this as a general statement. Certainly, there are many applications where this is the way to go. But I do think that it can be a sensible requirement to get exactly the same floating point results for some applications, at least when staying on the same OS (Windows, in this case). In fact, we had the very same issue with set_FMA3_enable a while ago. I am a software developer for a traffic simulation, and minor differences such as 10^-16 often build up and lead to entirely different simulation results eventually. Naturally, one is supposed to run many simulations with different seeds and average over all of them, making the different behavior irrelevant for the final result. But: Sometimes customers have a problem at a specific simulation second for a specific seed (e.g. an application crash or incorrect behavior of an entity), and not being able to reproduce it on our developer machines due to a different CPU makes it much harder to diagnose and fix the issue. Moreover, if the test system consists of a mixture of older and newer CPUs and test cases are not bound to specific resources, means that sometimes tests can deviate seemingly without reason (flaky tests). This is certainly not desired. Requiring exact reproducibility also makes writing the tests much easier because you do not require heuristic thresholds (e.g. a tolerance or some guessed value for the amount of samples). Moreover, our customers expect the results to remain stable for a specific version of the program since they calibrated (more or less...) their traffic networks to real data. This is somewhat questionable, since (again) one should actually look at averages, but the naive expectation in reality usually wins.

As one of the commenters who was advocating for some tolerance, I appreciate your comments about striving for exactitude. In the end I think it's a tradeoff: there can be costs to inexactitude, such as the non-reproducibility of simulations that you mentioned, but then again, there are costs to tracking down and then finding a way to correct every inexactitude, and those can be high, too! So sometimes you have to choose your poison. (Btw, in the context of traffic simulations, on first reading I completely misinterpreted those words "a problem — e.g. a crash". :-) ) — Steve Summit, Oct 15 '22 at 13:26
Heh, your right ;-) I amended the text to make it more clear. Also I agree with it being a trade-off. However in practice we did not have much trouble over the years requiring reproducibility when staying on a single platform (OS + tool chain). AFAIK it was mostly the `set_FMA3_enable` thingy a few years ago and a change by Microsoft to the behavior of `printf` 1 or 2 years ago. — Sedenion, Oct 15 '22 at 14:13
This is a plausible explanation of the FP difference. It's not clear whether you knew or were just guessing, but Intel64 Family 6 Model 42 Stepping 7 dates from before Intel CPUs supported FMA3, whereas the other two postdate the addition of FMA3 to Intel designs. — John Bollinger, Oct 15 '22 at 14:31
As far as the question of testing exact FP results, I can appreciate that doing so might be desirable in conjunction with the particular application you describe, but the constraints on that application are atypical, and indeed, not entirely sensible (as you acknowledge yourself). I would accept that there are cases where you do want to test exact FP results, but this answer seems to suggest that it is *always* reasonable and appropriate to test exact FP results, and I utterly reject that. — John Bollinger, Oct 15 '22 at 14:39
@JohnBollinger I completely agree with you that exact reproducibility is not always (or even rarely) a sensible requirement, and I am sorry that my post sounded like this (although I did mention "for some applications"). I edited it to emphasize that my statement holds only for certain applications. On the other hand, I do reject the notion of the comments suggesting that it can *never* be a sensible requirement. It really depends on the application, and without knowing the situation of the OP, we cannot make a fair judgment. — Sedenion, Oct 15 '22 at 16:23
The values in the question have more than the 15 significant digits of precision supported by double precision IEEE-745 FP. It is probably unreasonable and even meaningless to expect reproducibility beyond that. — Clifford, Oct 15 '22 at 18:27

score 0 · Answer 2 · answered Oct 15 '22 at 18:16

IEEE-745 double precision binary floating point provides no more than 15 decimal significant digits of precision. You are looking at the "noise" of different library implementations and possibly different FPU implementations.

How to make calculations fully reproducible?

That is an X-Y problem. The answer is you can't. But it is the wrong question. You would do better to ask how you can implement valid and robust tests that are sympathetic to this well-known and unavoidable technical issue with floating-point representation. Without providing the test code you are trying to use, it is not possible to answer that directly.

Generally you should avoid comparing floating point values for exact equality, and rather subtract the result from the desired value, and test for some acceptable discrepancy within the supported precision of the FP type used. For example:

#define EXPECTED_RESULT  40965.8966304650
#define RESULT_PRECISION 00000.0000000001

double actual_result = test() ;
bool error = fabs( actual_result-
                   EXPECTED_RESULT ) > 
                   RESULT_PRECISION ;

@armoken you miss my point. Striving for the reproducibility you expect is futile. That s is the solution to the inevitability of FP error. — Clifford, Dec 03 '22 at 22:23

score 0 · Answer 3 · answered Nov 23 '22 at 21:54

0

First of all, 40965.8966304650828827e-01 cannot be a result from cos() function, as cos(x) is a function that, for real valued arguments always returns a value in the interval [-1.0, 1.0] so the result shown cannot be the output from it.
Second, you will have probably read somewhere that double values have a precision of roughly 17 digits in the significand, while your are trying to show 21 digit. You cannot get correct data past the ...508, as you are trying to force the result farther from the 17dig limit.

The reason you get different results in different computers is that what is shown after the precise digits are shown is undefined behaviour, so it's normal that you get different values (you could get different values even on different runs on the same machine with the same program)

answered Nov 23 '22 at 21:54

Luis Colorado

10,974
1
16
31

1. Yes, it my mistake, but this is real problematic value from test results. – armoken Dec 03 '22 at 21:25
2. Yes, but accumulation of error over multiple computations can result in large errors. And we need to know what can affect the results, because a large resources required now to make current code generate results fully independent from hardware. – armoken Dec 03 '22 at 21:38
1. then you have a mistake in your code. the error can grow, but not so much. 2. Your accumulated error can become huge if you substract two quantities of about the same magnitude (e.g. if you subtract `1.1235645 - 1.1235558`) because the relative error can grow up to ~100% and this can be amplified if you multiply that result by a large number. But normally accumuluation errors tend to compensate in pairs and maintain the relative error the same) Look for a book in error theory. You need it. – Luis Colorado Dec 05 '22 at 06:53

Standard math functions reproducibility on different CPU's

3 Answers3

Linked