6

Problem

We've got multiple machines running Ubuntu with very similar specs. We ran a simple program to verify an issue we are seeing occur in the Windows VM each of these machines are running. Compiled using gcc 4.8.4 on a 64-bit Linux machine and v140 in Visual Studio on a 64-bit Windows VM.

#include <cmath>
#include <stdio.h>

int main()
{
  double num = 1.56497856262158219209;
  double numHalf = num / 2.0;

  double cosVal = cos(num);
  double cosValHalf = cos(numHalf);

  printf("num = %a\n", num);
  printf("numHalf = %af\n", numHalf);
  printf("cosVal(num) = %a\n", cosVal);
  printf("cosValHalf(numHalf) = %a\n", cosValHalf);

  //system("pause");
  return 0;
}

The issue arises when running the same binary file on host machines with certain CPUs.

Results

On Linux, all machines produce the same output. On the Windows VM, different results are produced even though the VM versions and settings are the same. Additionally, binaries generated on each VM would produce different results when moved to a different host machine; i.e. a binary generated in VM2, but executed on LM1, returned the same results as if VM1 generated the binary. We even copied the VM to confirm this behavior and sure enough it continues.

With the efforts described above, I'm thinking its not a library difference or a VM issue. As for the outputs, the following CPUs produce these results:

  • Intel® Xeon(R) CPU E5-2630 0
  • Intel® Xeon(R) CPU E5-2630 v2

The former CPUs produce uniform results between Linux and Windows. The results are in hex because the readability mattered less than whether there was a discrepancy.

num = 0x1.90a26f616699cp+0
numHalf = 0x1.90a26f616699cp-1
cosVal(num) = 0x1.7d4555e817bdcp-8
cosValHalf(numHalf) = 0x1.6b171bb5e3434p-1

These CPUs produces different results on a Windows VM than their Linux counterpart:

  • Intel® Xeon(R) CPU E5-2630 v3
  • Intel® Xeon(R) CPU E3-1270 v5

I'm not sure how these results are getting produced. The disassembly on VS2015 shows that both programs generate the same instructions regardless of which host machine it was compiled on.

num = 0x1.90a26f616699cp+0
numHalf = 0x1.90a26f616699cp-1
cosVal(num) = 0x1.7d4555e817bdcp-8
cosValHalf(numHalf) = 0x1.6b171bb5e3435p-1

Question

Why would Windows on a VM handle a binary differently when put on a machine with a specific CPU?

Looking at the differences between the CPUs E5-2630 v2 and E5-2630 v3 for example it appears the CPUs producing different results support AVX2, F16C and FMA3 instructions sets where as the former CPUs do not. However if that were the reason for the discrepancy I would also figure the results would remain uniform between Linux and Windows. Also, the disassembly showed the registers used were still the same on either chip. By debugging the file and stepping through each instruction you would think the behavior would be similar.

All this summed up, its probably the difference in architecture. Any thoughts on how I can be sure?

Resources

I've found the following questions somewhat useful regarding solutions promoting cross-platform consistency and making results more deterministic. I also took a long walk through floating-point comparison and can not recommend it enough for anyone curious about the topic.

Juno
  • 61
  • 4
  • 1
    It's because you used different compilers, not so much because of Windows per-se. Try compiling with MinGW on Windows, with the same code-gen options as on Linux. See also https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/, and also older 32-bit Visual C++ and directx libraries apparently used to mess with the x87 precision settings for slightly faster FP divide: https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/ – Peter Cordes Sep 26 '18 at 19:43
  • Yes, FMA does remove one temporary rounding, so you're probably seeing compilers take advantage of it when available if you're getting different results. – Peter Cordes Sep 26 '18 at 19:45
  • Related: [Is SSE floating-point arithmetic reproducible?](https://stackoverflow.com/q/15147174) (different CPUs will run the same asm the same way). But getting identical asm from different compilers is the problem: [Does any floating point-intensive code produce bit-exact results in any x86-based architecture?](https://stackoverflow.com/q/27149894) – Peter Cordes Sep 26 '18 at 19:48
  • "It's because you used different compilers...different CPUs will run the same asm the same way" This is not the behavior I am experiencing. A binary created on one CPU (E5-2630 v2) behaves differently than when the same, unmodified binary is ran on a different CPU (E5-2630 v3). – Juno Sep 26 '18 at 20:39
  • 1
    Yup. Are you sure none of the compilers were doing dynamic dispatching to a different version of the function for CPUs with FMA3? Maybe set a breakpoint on the disassembly you were looking at, and see if it's actually reached. – Peter Cordes Sep 26 '18 at 20:41
  • Edited my initial response. However, responding to the dynamic dispatching check - I don't imagine calling `cos(x)` alone would cause that. After stepping through each assembly instruction, I can see that each instruction is getting called. – Juno Sep 26 '18 at 21:13
  • Runtime dispatch can happen at dynamic link time. glibc already does that for `memset` / `memcpy` and other functions, because once the optimal version is resolved, there's no extra overhead; library calls are already an indirect call. `libm` *could* (but probably doesn't) come with a `-march=haswell` version of `cos` that uses FMA. But maybe MSVC does that? – Peter Cordes Sep 26 '18 at 23:00

1 Answers1

1

You can compile your program as an ELF binary on Linux and then run it on Linux. You can then copy that ELF binary onto your Windows system and run it under the Windows Subsystem for Linux. The FP initialization should be the same for both systems. Now you are running the same floating point instructions on both systems and the floating point results should be the same. If they aren't (which is unlikely), it is because of differing initializations.

You can also run this ELF binary on different architectures and systems (FreeBSD, ...). The results should all be the same. At that point you can rule out architecture+microarchitecture and rule in Windows and Linux compiler+runtime differences.

You may also be able to use Visual Studio to compile to an ELF binary and repeat this for the different systems and architectures. Those results should be the same but possibly different from the Linux GCC/Clang ELF.

Olsonist
  • 2,051
  • 1
  • 20
  • 35