Numerical regression in x64 with the VS2013 compiler with latest Haswell chips?

Question

We recently started seeing unit tests fail on our build machine (certain numerical calculations fell out of tolerance). Upon investigation we found that some of our developers could not reproduce the test failure. To cut a long story short, we eventually tracked the problem down to what appeared to be a rounding error, but that error was only occurring with x64 builds on the latest Haswell chips (to which our build server was recently upgraded). We narrowed it down and pulled out a single calculation from one of our tests:

#include "stdafx.h"
#include <cmath>

int _tmain(int argc, _TCHAR* argv[])
{
  double rate = 0.0021627412080263146;
  double T = 4.0246575342465754;
  double res = exp(-rate * T);
  printf("%0.20e\n", res);
  return 0;
}

When we compile this x64 in VS2013 (with the default compiler switches, including /fp:precise), it gives different results on the older Sandy Bridge chip and the newer Haswell chip. The difference is in the 15th significant digit, which I think is outside the machine epsilon for double on both machines).

If we compile the same code in VS2010 or VS2012 (or, incidentally, VS2013 x86) we get the exact same answer on both chips.

In the past several years, we've gone through many versions of Visual Studio and many different Intel chips for testing, and no-one can recall us ever having to adjust our regression test expectations based on different rounding errors between chips.

This obviously led to a game of whack-a-mole between developers with the older and newer hardware as to what should be the expectation for the tests...

Is there a compiler option in VS2013 that we need to be using to somehow mitigate the discrepancy?

Update:

Results on Sandy Bridge developer PC:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983898980000e-001

Results on Haswell build server:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983899090000e-001

Update:

I used procexp to capture the list of DLLs loaded into the test program.

Sandy Bridge developer PC:
apisetschema.dll
ConsoleApplication8.exe
kernel32.dll
KernelBase.dll
locale.nls
msvcr120.dll
ntdll.dll

Haswell build server:
ConsoleApplication8.exe
kernel32.dll
KernelBase.dll
locale.nls
msvcr120.dll
ntdll.dll

Just to clarify, this happens when you **run** the code on a Haswell ? The same binary will behave differently ? — ElderBug, Jan 12 '15 at 15:34
Correct, this runs differently depending on the machine we deploy it to (if it was compiled for x64 in VS2013). — lesscode, Jan 12 '15 at 15:41
@TC, I've run dumpbin to disassemble, but there's a lot of it. Is there anything specific I should be looking for (I'm not an assembler expert)? — lesscode, Jan 12 '15 at 15:51
Well, that's pretty unlikely to have anything to do with the processor. You ought to be looking for an injected DLL that screws up the MXCSR register, changing the rounding mode. Similar [to this](http://stackoverflow.com/a/27636509/17034). Document the actual values you observe. — Hans Passant, Jan 12 '15 at 16:07
Floating-point arithmetic is not usually deterministic, especially not the transcendental functions (the nigh unimplementable strictures of the latest IEEE-754 revision notwithstanding). Anyway, "/fp:precise" merely tries to generate accurate results. You want "/fp:strict" for an attempt at determinism. — doynax, Jan 12 '15 at 17:09
Building with `/fp:strict` doesn't seem to have any effect on the outcome. — lesscode, Jan 12 '15 at 17:20
The problem is that the algorithms for computing transcendental function correctly rounded down to the last bit are prohibitively expensive. Therefore compiler vendors and CPU manufacturers use approximations, and ones which may differ between processor revisions and compiler releases. The `/fp:strict` switch disables the optimizer and makes basic arithmetic reproducible, but unless Microsoft decides to go out of their way to make it so the `` functions generally won't be. I suspect that your only real options are to use a larger epsilon (or to manually reeimplement `exp()`.) — doynax, Jan 13 '15 at 13:23
In the Visual C++ 2013 runtime libraries (msvcr120.dll and friends) for x64 we added new implementations of some of the most common transcendental functions, including exp, that take advantage of the FMA3 instruction set when available. These new functions may use different algorithms depending on instruction set availability. As @doynax notes, it's prohibitively expensive to produce correctly rounded results in many cases. In this particular case, the difference is 1ulp (and the base SSE2 implementation used on your Sandy Bridge is the correctly rounded result, I think). — James McNellis, Jan 16 '15 at 04:44
(I don't have an FMA3-capable machine handy here at home; I will double-check the results you cite and provide a more formal answer tomorrow.) — James McNellis, Jan 16 '15 at 04:47
@James, Thanks. So is there a way to get the 2013 runtime to behave the old way? — lesscode, Jan 16 '15 at 04:48
You can disable use of the FMA3 library by calling `_set_FMA3_enable(0)` during process startup (ideally at the _very_ beginning of `main`, or in a global initializer). While this will force use of the SSE2 implementations, results may still differ between major versions of the runtime library as we do continue to make changes to the math library to improve overall accuracy and performance and to fix bugs. (Do note that the new FMA3 implementations of these functions are much, much, much faster.) — James McNellis, Jan 16 '15 at 04:53
Calling _set_FMA3_enable(0) in DllMain causes all of our numerical tests to pass on both CPUs. Thanks James. — lesscode, Jan 16 '15 at 15:11
@James, I assume that calling _set_FMA3_enable once per process rather than once per thread is sufficient, right? ie on DLL_PROCESS_ATTACH and not on DLL_THREAD_ATTACH — lesscode, Jan 16 '15 at 15:44
Is this issue fixed in some newer VS2013 Update (so that `_set_FMA3_enable(0);` is not needed any more)? — MrTux, Dec 03 '15 at 23:48

Hans Passant · Answer 1 · 2015-01-12T17:15:54.797

7

The results you documented are affected by the value of the MXCSR register, the two bits that select the rounding mode are important here. To get the "happy" number you like, you need to force the processor to round down. Like this:

#include "stdafx.h"
#include <cmath>
#include <float.h>

int _tmain(int argc, _TCHAR* argv[]) {
    unsigned prev;
    _controlfp_s(&prev, _RC_DOWN, _MCW_RC);
    double rate = 0.0021627412080263146;
    double T = 4.0246575342465754;
    double res = exp(-rate * T);
    printf("%0.20f\n", res);
    return 0;
}

Output: 0.99133347998389898000

Change _RC_DOWN to _RC_NEAR to have MXCSR in normal rounding mode, the way the operating system programs it before it starts your program. Which produces 0.99133347998389909000. Or in other words, your Haswell machines are in fact producing the expected value.

Exactly how this happened can be very hard to diagnose, the control register is the worst possible global variable you can think of. The usual cause is an injected DLL that reprograms the FPU. A debugger can show the loaded DLLs, compare the lists between the two machines to find a candidate.

edited Jan 12 '15 at 17:15

answered Jan 12 '15 at 16:38

Hans Passant

922,412
146
1,693
2,536

I updated the test as above to use _RC_NEAR, but the result is still 9.91333479983899090000e-001 on the Haswell machine. Is that what you'd expect? – lesscode Jan 12 '15 at 16:52
@Wayne I have a Haswell as well (ha!), and `_RC_DOWN` is the one needed to reproduce the expected behavior. – ElderBug Jan 12 '15 at 16:52
Yikes, I screwed that up. Rewritten. – Hans Passant Jan 12 '15 at 17:16
Perhaps the OP need to invest in different build software ;p – leppie Jan 12 '15 at 17:22
1

@Hans, so if I understand you correctly, if I change the test program to force _RC_NEAR (the system default), I should get the same answer (9.91333479983899090000e-001) on both machines? I don't. I've checked the list of modules (main post updated). – lesscode Jan 12 '15 at 17:49
Also, if I write out the original value returned by _control_fp, it's _RC_DOWN in both cases. – lesscode Jan 12 '15 at 18:02
Sorry, it's actually zero (_RC_NEAR) in both cases, not _RC_DOWN – lesscode Jan 12 '15 at 18:10
Also, does any of this explain why the default behaviour is different in VS2012- and VS2010-compiled versions? – lesscode Jan 12 '15 at 20:12

score 0 · Answer 2 · answered Mar 20 '17 at 20:52

Due to a bug in the MS 2013 CRT x64 CRT code improperly detecting AVX and FMA3 support.

Fixed in an updated 2013 runtime, or using a newer MSVC version, or just disabling the feature support at runtime by calling" "_set_FMA3_enable(0);".

See: https://support.microsoft.com/en-us/help/3174417/fix-programs-that-are-built-in-visual-c-2013-crash-with-illegal-instruction-exception

Numerical regression in x64 with the VS2013 compiler with latest Haswell chips?

2 Answers2