3

The following C code was compiled today on two systems with Microsoft's compiler (installed with Visual Studio 2017 Community), both of which had modern 64-bit Intel processors and were running Windows 10:

#include <math.h>
#include <stdio.h>

int main() {
    int a = 64, b = 2;
    double logA = log(a);
    double logB = log(b);
    double c = logA / logB;
    printf("logA: %0.20lf\n", logA);
    printf("logB: %0.20lf\n", logB);
    printf("c: %0.20lf\n", c);
    printf("result: %d\n", ((int)c));
    return 0;
}

Let's call them "system A" and "system B" to keep things clear. Both systems printed exactly the same value for logA and logB:

logA: 4.15888308335967149532
logB: 0.69314718055994528623

However for c and result system A printed:

c: 6.00000000000000000000
result: 6

...And system B printed:

c: 5.999999999999...
result: 5

(I no longer have the exact result it produced, but sufficed to say it was ever so slightly smaller than 6.0.)

I tried compiling the same code on several other systems, including an older system running Windows 7 with VS Community 2013 as well as a mac running macOS 10.12 and Xcode 9.3.1, and their results agreed with system A, not system B.

Here's where it gets really strange:

To confound things even further, when the program was compiled two weeks ago on system B it produced same result as system A! Something in the interim changed causing the compiled program to produce a different answer.

My question is: how on earth did that happen?!

I'm aware of floating point inaccuracy and am not trying to justify or excuse the mistake of dividing the result of log(a) and log(b) without rounding or otherwise accounting for inaccuracy. That's clearly a bug.

But this particular bug only revealed itself in the last two weeks, and only on one system. I wasn't able to reproduce it anywhere else.

As far as I know the result of double-precision floating point math is based on the CPU, not the operating system, the compiler, the C runtime, or any other software.

Why might a system that produced 6.0000000000 two weeks ago suddenly switch to producing 5.9999999999?

(Granted Windows 10 likes to automatically update itself silently, but the owner of system B did not manually install any updates that could easily explain this change.)

For both my own education and peace of mind I'd really love to know the answer to this.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
Bri Bri
  • 2,169
  • 3
  • 19
  • 44
  • Did the system *consistently* produce a different result, or did it just happen one time? My best guess is 80-bit FPU registers getting flushed to 64-bit during a context switch, but that's really just a shot in the dark. – Daniel Pryden Sep 15 '18 at 23:52
  • System B consistently produced a different result from system A. It was not a one-time thing. – Bri Bri Sep 15 '18 at 23:53
  • Did you update VS to 15.8.4 in the interim? (theoretically that shouldn't matter, but there is no telling what all changed behind the scenes) – David C. Rankin Sep 15 '18 at 23:59
  • I guess this will turn on what "modern 64-bit processor" you have. I have just, for curiosity, compiled the code on VS13, VS17(15.8.4), gcc MinGW (gcc 6.3) on windows and gcc 8.2.1 on Linux (on i7 and Core2 intel and on AMD) and all report your 'system A' results. I cannot produce 'system B' results. What processor do you have for 'system B'? – David C. Rankin Sep 16 '18 at 00:18
  • 1
    The hypothesis in the question that “the result of double-precision floating point math is based on the CPU, not the operating system, the compiler, the C runtime, or any other software” is incorrect. Intel processors provide some floating-point assist functions, but they are limited, and the standard library math routines are implemented in software supplied with the compiler and/or the operating system. Microsoft’s routines have been notorious for some inaccuracy in the past. – Eric Postpischil Sep 16 '18 at 00:20
  • Potential sources of variation between the behavior two weeks ago and today include: Misremembered source, changed compilation options, changed link or build options (if they changed a dynamic library used), system updates (may have provided a new dynamic library). The C standard permits `double c = log(a)/log(b)` to yield a different result than `double logA = log(a); double log(b); double c = logA/logB;` because it permits the former to be computed with greater precision than the `double` format nominally has. So difference source for the “same” calculations can have different results. – Eric Postpischil Sep 16 '18 at 00:22
  • I was able to reproduce both cases using VS 2015. Passing `/fp:fast` to the compiler produces 5.999... But passing `/fp:strict` or `/fp:precise` produces 6.00. Check which option you are using. Different options use different x86 FP instructions, and so the results are different. – Hadi Brais Sep 16 '18 at 01:13
  • 2
    It seems that with `/fp:fast`, the compiler is calling `log` only once and uses `mulsd` to compute an approximation for the second `log`. On the other hand, `/fp:strict` or `/fp:precise` emits both calls to `log` and uses `divsd` to calculate `c`. – Hadi Brais Sep 16 '18 at 01:19
  • The default is `/fp:precise`. It appears to me that you were using `/fp:fast`. But then you changed this to `/fp:precise`. – Hadi Brais Sep 16 '18 at 01:21
  • 2
    [log and other transcendental functions aren't required to be properly rounded](https://stackoverflow.com/a/31509332/995714). And to calculate the length of a number there are better ways: [`_BitScanForward64`](https://msdn.microsoft.com/en-us/library/wfd9z0bb.aspx?f=255&MSPPError=-2147217396) which translates to a single instruction, instead of hundreds or thousands to calculate the logarithm. See also [counting bits or find the right|left most ones](https://stackoverflow.com/q/9093323/995714), http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious – phuclv Sep 16 '18 at 01:47
  • @EricPostpischil It's not the log function that's producing different results, but rather floating point division. That's handled by the CPU, is it not? – Bri Bri Sep 16 '18 at 03:06
  • @HadiBrais The different `/fp` options is an interesting clue. However as of now I don't know how the build options would have changed. This software is built with a custom build script that has not been modified over those two weeks. But it's certainly worth investigating since you were able to replicate both results. – Bri Bri Sep 16 '18 at 03:07
  • @DanielPryden: Windows context switches save/restore the full 80 bit state of the x87 regs, I assume using `xsave` or `xsaveopt`. Interesting hypothesis, but no mainstream OS would introduce FP indeterminacy by only saving part the x87 state with `double` instead of 80-bit. Also, that would break any MMX code that had the FPU in MMX state. Also, the OP describes using x86-64 Windows. But they didn't say they actually set their compiler to make 64-bit executables on all systems, which would use SSE2 instead of x87 for `double`. – Peter Cordes Sep 16 '18 at 03:48
  • @GuyGizmo: I see nothing that points to division as the problem, other than Hadi Brais’ note that the compiler may be multiplying by the inverse rather than dividing. – Eric Postpischil Sep 16 '18 at 10:27
  • I voted to close as too broad (although unclear would work as well) since OP does not have the information necessary to reproduce the problem, notably the exact state of the system two weeks ago when the problem occurred. Various causes of inaccuracies in floating-point computations that arise in situations like this are well known to practitioners and are espoused in other Stack Overflow questions, and it is unlikely anything surprising to experienced practitioners occurred here. So there is likely nothing to learn, and likely no specific solution to be given without the missing information. – Eric Postpischil Sep 16 '18 at 10:32
  • I own system B and I wrote the offending line of code (five years ago) (not what @GuyGizmo posted- that's debugging code.) The line was: ```int num_levels=(int)(log(num_points/LINES_IN_BOX)/log(2))+1;``` That's embarrassingly bad code and I have little to say in my own defense. Well, we all write a clunker now and then. It's part of a novel algorithm for repeatedly finding closest points on a piecewise Bezier that is an order of magnitude faster than anything I could find in the literature five years ago. I think I might have been focusing on the algorithm more than the code. – Tagore Smith Sep 16 '18 at 19:45
  • System B has an i7-8700k, not that I think that's very relevant. The code did not change (I know because the first thing I did when this bug surfaced was revert to a known good state in git- but this code hasn't changed in years.) The build script is also under version control, so it's not that. I don't use system B for most development- I use a Macbook. But this bit of software calls for a beefy system, and its users will mainly run Windows. I didn't do anything on this system aside from a bit of gaming and web-browsing between working and non-working. I suspect MS updated something. – Tagore Smith Sep 16 '18 at 20:11
  • That said, I kind of agree with @EricPostpischil, which is why Guy posted this instead of me- I would not have. I wrote a bad line of code five years ago, it worked on many systems for many years, eventually it caught up with us and we lost a day or two to debugging it (it's deep in the bowels of a very complicated piece of software.) That sucks, but once we found the line... well, in a way I am glad this surfaced, though I regret the time we lost finding it and I regret writing it in the first place. I already know Windows and MS are whack. They probly updated things in my sleep. – Tagore Smith Sep 16 '18 at 20:23

0 Answers0