Floating point computation changes if stored in intermediate "double" variable

Question

I am trying to write a simple log base 2 method. I understand that representing something like std::log(8.0) and std::log(2.0) on a computer is difficult. I also understand std::log(8.0) / std::log(2.0) may result in a value very slightly lower than 3.0. What I do not understand is why putting the result of a the calculation below into a double and making it an lvalue then casting it to an unsigned int would change the result compared to casting the the formula directly. The following code shows my test case which repeatedly fails on my 32 bit debian wheezy machine, but passes repeatedly on my 64 bit debian wheezy machine.

#include <cmath>
#include "assert.h"

int main () {
  int n = 8;
  unsigned int i =
    static_cast<unsigned int>(std::log(static_cast<double>(n)) /
                              std::log(static_cast<double>(2)));
  double d =
    std::log(static_cast<double>(n)) / std::log(static_cast<double>(2));
  unsigned int j = static_cast<unsigned int> (d);
  assert (i == j);
}

I also know I can use bit shifting to come up with my result in a more predictable way. I am mostly curious why casting the double that results int he operation is any different than sticking that value into a double on the stack and casting the double on the stack.

The x87 uses 80-bit arithmetic internally. Lots of compilers, when targeting the x87, just pretend those extra bits aren't there. This sort of ugliness on x87 is a common result. Tell the compiler to use SSE instead of x87 with something like `-mfpmath=sse`. — tmyklebu, Mar 30 '15 at 00:00
To dovetail with the remarks by @tmyklebu : tool chains for 32-bit x86 systems usually default to targeting the x87 FPU for floating-point computation, while tool chains for 64-bit systems usually default to the use of SSE. A combination of excess precision, double rounding, etc then leads to the kind of effect you are observing here. — njuffa, Mar 30 '15 at 00:40

M.M · Answer 1 · 2015-03-29T23:53:40.013

6

In C++, floating point is allowed to do this sort of thing.

One possible explanation would be that the result of the division is calculated internally in a higher precision than double, and stored in a register with higher precision than double.

Converting this directly to unsigned int gives a different result to first converting this to double and then to unsigned int.

To see exactly what is going on , it might be helpful to look at the assembly output generated by your compiler for the 32-bit case.

Needless to say, you shouldn't write code that relies on exactness of floating point operations.

edited Mar 29 '15 at 23:53

answered Mar 29 '15 at 23:04

M.M

138,810
21
208
365

1

On nicer compilers, storing the result as `long double` would yield the expected result, since that type would store 64 bits on machines which use 64-bit intermediate results, and 80 bits (and maybe some padding) on machines which use 80-bit intermediate results. Unfortunately, a lot of code using things like `printf` doesn't properly distinguish between the `%f` and `%lf` format specifiers, and will break if `long double` values are passed as 80 bits; MS decided the simplest fix was to have `long double` be 64 bits even on machines where intermediate computations are computed using 80 bits. – supercat Mar 31 '15 at 21:19
1

`%f` and `%lf` are defined as being the same . The long double one is `%Lf` – M.M Mar 31 '15 at 21:32
1

My point remains that a lot of code which uses `long double` doesn't properly specify the format string, and some compiler vendors have decided the solution was to make `long double` be 64 bits everywhere even though that totally wrecks the semantics of that type. – supercat Mar 31 '15 at 22:22

Floating point computation changes if stored in intermediate "double" variable

1 Answers1

Linked