2

What checks can I perform to identify what differences they are in the floating point behaviour of two hardware platforms?

Verifying IEE-754 compliance or checking for known bugs may be sufficient (to explain a difference in output that I've observed).

I have looked at the CPU flags via /proc/cpu and both claim to support SSE2 I looked at:

but they look challenging to use. I've built TestFloat but I'm not sure what to do with it. The home page says:

"Unfortunately, TestFloat’s output is not easily interpreted. Detailed knowledge of the IEEE Standard is required to use TestFloat responsibly."

Ideally I just want one or two programs or some simple configure style checks I can run and compare the output between two platforms.

Ideally I would then convert this into configure checks to ensure that the an attempt to compile the non-portable code on a platform that behaves abnormally its detected at configure time rather than run time.

Background

I have found a difference in behaviour for a C++ application on two different platforms:

  • Intel(R) Xeon(R) CPU E5504
  • Intel(R) Core(TM) i5-3470 CPU

Code compiled natively on either machine runs on the other but for one test the behaviour depends on which machine the code is run on.

Clarification The executable compiled on machine A behaves like the executable compiled on machine B when copied to run on machine B and visa versa.

It could an uninitialised variable (though nothing showed up in valgrind) or many other things but I suspected that the cause could be non-portable use of floating point. Perhaps one machine is interpreting the float point assembly differently from the other? The implementers have confirmed they know about this. Its not my code and I have no desire to completely rewrite it to test this. Recompiling is fine though. I want to test my hypothesis.

In the related question I am looking at how to enable software floating point. This question is tackling the problem from the other side.

Update

I've gone down the configure check road tried the following based on @chux's hints.

#include <iostream>
#include <cfloat>

int main(int /*argc*/, const char* /*argv*/[])
{
   std::cout << "FLT_EVAL_METHOD=" << FLT_EVAL_METHOD << "\n";
   std::cout << "FLT_ROUNDS=" << FLT_ROUNDS << "\n";
#ifdef __STDC_IEC_559__
   std::cout << "__STDC_IEC_559__ is defined\n";
#endif
#ifdef __GCC_IEC_559__
   std::cout << "__GCC_IEC_559__ is defined\n";
#endif
   std::cout << "FLT_MIN=" << FLT_MIN << "\n";
   std::cout << "FLT_MAX=" << FLT_MAX << "\n";
   std::cout << "FLT_EPSILON=" << FLT_EPSILON << "\n";
   std::cout << "FLT_RADIX=" << FLT_RADIX << "\n";
   return 0;
}

Giving identical output on both platforms:

./floattest 
FLT_EVAL_METHOD=0
FLT_ROUNDS=1
__STDC_IEC_559__ is defined
FLT_MIN=1.17549e-38
FLT_MAX=3.40282e+38
FLT_EPSILON=1.19209e-07
FLT_RADIX=2

I'm still looking for something that might be different.

Bruce Adams
  • 4,953
  • 4
  • 48
  • 111
  • 1
    Verifying IEEE754 compliance is **not** sufficient, as two compliant implementations may still give different results. That's one of the reasons why there is such a warning on TestFloat; IEEE754 is incredibly complex to master. – MSalters Mar 22 '19 at 14:13
  • I said *might*. What I mean is if a non-compliance or well know hardware bug is reported on my machine it might explain the difference in behaviour I'm seeing. I've added a clarification just in case. – Bruce Adams Mar 22 '19 at 14:34
  • That's why I added as a comment. Your "Verifying IEEE754 compliance _may_ be sufficient" (to find differences) indicates you're not sure. And indeed, your observed difference could be a difference between two compliant IEEE754 implementations. To name just a trivial example, the rounding of `sin(1.0)` is NOT specified by IEEE754. – MSalters Mar 22 '19 at 14:41
  • I have another wacky idea relating to this see [emulator-to-run-an-application-as-if-its-on-a-different-cpu](https://softwarerecs.stackexchange.com/questions/56933/emulator-to-run-an-application-as-if-its-on-a-different-cpu) – Bruce Adams Mar 22 '19 at 15:06
  • `long double` is more often different between platforms than `double/float` and `float` more often different that `double`, so checking `long double` characteristics are more likely to uncover issues that `float`. Unexpected sub-normals can sometimes be found examining `xxx_TRUE_MIN`. Support for various rounding modes sometimes varies. See `xxx_HAS_SUBNORM` and others in C spec "B.6 Characteristics of floating types ". Might as well check them all. – chux - Reinstate Monica Mar 22 '19 at 18:14
  • Interestingly, the detection of variations (in the FP world) is akin to my general [variation detection](https://codereview.stackexchange.com/q/215113/29485) attempt of C. – chux - Reinstate Monica Mar 22 '19 at 18:16
  • Detail on printing FP values to look for small differences. The printing of 6 decimal places after the `.` would be better to show all significant digits. The is more like 9, 17 for `float, double`. See `xxx_DECIMAL_DIG`. [ref](https://stackoverflow.com/q/16839658/2410359) or use hex format output. – chux - Reinstate Monica Mar 22 '19 at 18:24
  • Your clarification in the question confuses me. Maybe it's my english skills, but that sentence seem to say that everything is the same. So, what differs exactly? Can you describe exactly, how you produce the executables, what copy you do? – geza Mar 22 '19 at 19:40
  • @Bruce Adams Have you run the application with `valgrind` to ensure there is no access to uninitialized or out-of-bounds data? Did you compile the code for strictest supported adherence to floating-point standards? For example, with my Intel compiler version 13, I would specify `/fp:strict`; I believe newer Intel compilers offer even more stringent settings for floating-point computation. Note that compiling with strict FP settings *may* lead to a noticeable drop in performance. Are both of your machines using the same math library version? – njuffa Mar 22 '19 at 21:03
  • @njuffa nothing showed up in valgrind. I'm using gcc (4.8) I've played a little with the compiler options but I'm not sure what to tweak. I'm not worried about performance for this test. I've tried running in docker to rule out operating system and library differences. – Bruce Adams Mar 22 '19 at 21:16
  • @geza There are two outputs X and Y. If I run the program on machine A I get output X. If I run the program on machine B I get output Y. It doesn't matter which machine the program was compiled on. I copy the executable and the shared libraries it uses that are not supplied by the OS vendor. This includes the one that has the portability issue. – Bruce Adams Mar 22 '19 at 21:20
  • @BruceAdams: is it 32 or 64-bit compilation? If 32, FPU or SSE? Do you use Direct3D (D3D could set FPU precision to 32-bit). Or maybe one part of the program branches on CPU capabilities. For example, if CPU supports FMA, then it uses it (it could cause a difference). There are a lot of possibilities where the difference could occur :) – geza Mar 22 '19 at 21:39
  • @gezza - 64-bit, SSE. Linux, no Direct3D or GPU usage. – Bruce Adams Mar 22 '19 at 21:50
  • @BruceAdams: I have only one suspect: some math routine gives a different result. Maybe because they changed the implementation (I suppose that you don't have the exact same libc, libm versions), or maybe because of FMA. It's hard to tell without accessing the machines. Note, that it is almost impossible that IEEE-754 compliance is the problem. As far as I know, Intel SSE is IEEE-754 compliant. Results of basic operations (+, -, *, /, sqrt) have to be the same on all machines (and the same means exact here: calculating the result in infinite precision, then rounding to the output precision). – geza Mar 22 '19 at 23:19
  • those x86 CPUs are still in the "same platform" so obviously the results should be the same. If there are any differences in floating-point behavior then it should be noted in Intel's documentation – phuclv Mar 23 '19 at 02:26

2 Answers2

0

OP has 2 goals that conflict a bit.

  1. How to detect differences in floating point behaviour across platforms (?)

  2. I just want one or two programs or some simple configure style checks I can run and compare the output between two platforms.

Yes some differences are easy to detect, but some differences can be exceedingly subtle.
Sample Can the floating-point status flag FE_UNDERFLOW set when the result is not sub-normal?

There are no simple tests for the general problem.

Recommend either:

  1. Revamp the coding goal to allow for nominal differences.

  2. See if _STDC_IEC_559__ is defined and hope that is sufficient for you application. Given various other factors like FLT_EVAL_METHOD and FLT_ROUNDS and optimization levels, code can still be compliant yet provide different results, yet the degree will be more manageable.

  3. If super high consistency is needed, do not use floating point.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • I've done a very basic check based on the things you've suggested so far and updating the question with the results. Rewriting the code is not an option for me at the moment. It is not 'my' code for start. I would have used integer because high consistency is desirable to me. – Bruce Adams Mar 22 '19 at 17:36
  • See my clarifcation: The executable compiled on machine A behaves like the executable compiled on machine B when copied to run on machnie B and visa versa. Some posts suggest "The same native assembly code is most likely deterministic provided you're careful with floating point flags and compiler settings." are there any obvious cases where this is not the case? – Bruce Adams Mar 22 '19 at 18:04
0

I found a program called esparanoia that does some checks of floating point behaviour. This is based on William Kahan's original paranoid program found the infamous Pentium division bug.

While it did not detect any problems with my test systems (and thus is not sufficient to answer the question) it might be of interest to someone else.

Bruce Adams
  • 4,953
  • 4
  • 48
  • 111