Blatant floating point error in C++ program

Question

I am assigning a double literal to a double variable. The variable's value gets truncated, otherwise I cannot understand why, for example the difference diff is 0.0.

Sorry for the code duplication at setprecision but I am really pissed off.

#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>

int main()
{
    long double d = 1300010000000000000144.5700788999;
    long double d1 = 1300010000000000000000.0;
    long double diff = d - d1; // shall be 144.5700788999!!!
    long double d2 = 0.5700788999;

    std::cout << "d = " << std::fixed << std::setprecision(std::numeric_limits<long double>::digits10 + 1) << d << '\n';
    std::cout << "d1 = " << std::fixed << std::setprecision(std::numeric_limits<long double>::digits10 + 1) << d1 << '\n';
    std::cout << "d - d1 = " << std::fixed << std::setprecision(std::numeric_limits<long double>::digits10 + 1) << diff << '\n';
    std::cout << "d2 = " << std::fixed << std::setprecision(std::numeric_limits<long double>::digits10 + 1) << d2 << '\n';
}

This is the output:

d = 1300009999999999900000.0000000000000000
d1 = 1300009999999999900000.0000000000000000
d - d1 = 0.0000000000000000
d2 = 0.5700788999000001

I am expecting diff to be 144.5700788999 but it is 0.0

So, how to deal with it? (Window 7 and higher, VS 2013)

...To use two doubles, one for high values and one for low values? Like, instead using d to use d1 and d2?

https://stackoverflow.com/questions/21557816/whats-the-c-suffix-for-long-double-literals i didn't check the standard for the details, but this might be useful. — Tiberiu Maran, May 30 '19 at 09:22
@HighPerformanceMark I really don't care if 'you keep losing count'. I expect a correct answer from the executable. Thank you, you failed to help me. I'll look for a library if that exists. — Emil Mocan, May 30 '19 at 09:52
@artcorpse Thank you, it seems does not help adding a suffix literal. — Emil Mocan, May 30 '19 at 09:55
`std::fixed` does not mean what you think it does, apparently. — Marc Glisse, May 30 '19 at 09:56
@MarcGlisse Actually, I have to parse a string and convert it to double. The code here is to make a point: I cannot represent exactly the literal '1300010000000000000144.5700788999' and expect correct math from it. — Emil Mocan, May 30 '19 at 10:02
Then your problem is that you have no idea what `double` is (hint: it isn't a BigFloat, of course you cannot represent all numbers), please read some documentation on that. — Marc Glisse, May 30 '19 at 10:04
@MarcGlisse I know it is not a BigFloat or something like that. But I find dumb that one can work either with very big number like 1300010000000000000144 or with a small number like 0.5700788999 but not with 1300010000000000000144.5700788999. I understood the limitations. I just find it really dumb. Just my 2 cents. — Emil Mocan, May 30 '19 at 10:07

score 2 · Accepted Answer · edited May 30 '19 at 13:15

2

80-bit long double (not sure about its size in MSVS) can store around 18 significant decimal digits without loss of precision. 1300010000000000000144.5700788999 has 32 significant decimal digits and cannot be stored exactly as long double.

Read Number of Digits Required For Round-Trip Conversions for more details.

edited May 30 '19 at 13:15

Eric Postpischil

195,579
13
168
312

answered May 30 '19 at 09:26

Maxim Egorushkin

131,725
17
180
271

1

80 bit with visual studio, are you sure? – Marc Glisse May 30 '19 at 09:48
@MarcGlisse I don't have studio, on gcc x86_64 `long double` is 80-bit. – Maxim Egorushkin May 30 '19 at 09:50
@MaximEgorushkin That is an interesting read, for sure. I'll live that to the week-end. Any quick solution/workaround? – Emil Mocan May 30 '19 at 09:56
@EmilMocan quick solution is to use a bigger type, say boost::mpfr_float (`mpfr_float::default_precision(1000);` to specify the size of the mantissa). Or if you are trying to do fixed point arithmetic, either use a type made for that, or some bigint. – Marc Glisse May 30 '19 at 10:02
@EmilMocan I cannot recommend anything for such big numbers, apart from using an arbitrary-precision math library. Or try using numbers that fit into `double`, `long double`. – Maxim Egorushkin May 30 '19 at 10:02
What Eric Postpischil said. How many decimal digits do you think `DBL_MAX`? (A lot.) – HolyBlackCat May 30 '19 at 13:20
@HolyBlackCat You missed _without loss of precision_. Some longer numbers can be stored, but generally not. `double` can store, for example, `DBL_MAX`, but not `DBL_MAX - 1`. – Maxim Egorushkin May 30 '19 at 13:24
@MaximEgorushkin https://gcc.godbolt.org/ has MSVC, and it's easy to look up on MSVC documentation: [Built-in types](https://learn.microsoft.com/en-us/cpp/cpp/fundamental-types-cpp?view=msvc-160). [MSVC already uses IEEE-754 binary64 for `long double` long ago and never supports 80-bit `long double` for decades](https://stackoverflow.com/q/7120710/995714) – phuclv May 14 '21 at 04:14

Yury Schkatula · Answer 2 · 2019-05-30T10:55:54.557

Well, you're faced Wild West of Floating Points! Do not trust anybody, do not expect much, keep your hand on your gun.

The thing is: floating point representation is a split. Given amount of bytes are spent to store two pieces, mantissa value and tenth power (simplified description for sure, however it would suffice our needs here). Once you have a value that is too much to fit into the mantissa, what should computer do? It has to carry the rest to another portion of bytes (like Big Math libraries do) or just round to the closest possible value. Let me show:

d2 =                      0.5700788999; // shows                      0.5700788999000001
d2 = 1300010000000000000000.5700788999; // shows 1300009999999999934464.0000000000000000000

Hey, where is my fractional part in 2nd case? It's gone! Call the police! Oh, wait, it just doesn't fit into... This is why diff gives zero: mantisses are so huge that the tail part (where the actual difference is) can not be stored. And as soon as the rest of the digits are the same, we have zero diff.

After careful comparison, you can spot another thing: printed value is close to the assigned one, however it's a bit different. This is because a mantissa is just a sum of powers of 2. So, to represent the value, computer has to round assigned value to closest binary-compatible one. This is another sort of pain sometimes and you should not compare floating-point numbers by equality operator, just evaluate a difference and compare it with expected delta of intended precision.

I got most of the points. The answer seems to be to use a BigMath library. A big plus for keeping the text relaxed and funny. Thanks. — Emil Mocan, May 30 '19 at 09:59

Blatant floating point error in C++ program

2 Answers2