Difference between double and float in floating point accuracy

Question

After reading this question, and this msdn blog, I have tried few examples to test this:

Console.WriteLine(0.8-0.7 == 0.1);

And yes, the expected output is False. Hence I try cast the expression in both side to double and float to see whether I can get a different result:

Console.WriteLine((float)(0.8-0.7) == (float)(0.1));
Console.WriteLine((double)(0.8-0.7) == (double)(0.1));

The first line output True but the second line output False, why is this happening?

Furthermore,

Console.WriteLine(8-0.7 == 7.3);
Console.WriteLine(8.0-0.7 == 7.3);

Both of the line above give True even without casting. And ...

Console.WriteLine(18.01-0.7 == 17.31);

This line output False. How is subtracting 8 difference from subtracting 18.01 if they both are subtracted by a floating point number?

I've tried to read through the blog and question, I can't seem to find answer else where. Can someone please explain to me why are all of these happening in Layman's language? Thank you in advance.

EDIT:

Console.WriteLine(8.001-0.001 == 8); //this return false
Console.WriteLine(8.01-0.01 == 8); //this return true

Note: I am using .NET fiddle online c# compiler.

[SharpLab](https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJboMLoN7LpGYZQAs6AsgBQCU+hxTUcAnNdQGYA2A9gIYAXWtQAMAOgAcAWgkB2egF5F6LnyEiJcWrQDcyAJAGW7asF4BXAEbcAppqmzxC9MtXnrdh9r2MiAX2R/IA=) is interesting here. It seems to be a compiler optimisation. — ProgrammingLlama, Jun 25 '19 at 08:38
The main problem with all of these question is that they are in fact just variations of the common closing question we have here, [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken), and the underlying topic and *full* explanation is *quite complex*. In order to cover everything you'd need to cover compiler optimizations, processor architecture, differences between in-cpu registers vs. memory storage formats, etc. Additionally, much of this is also implementation details which means *it could change*. Can you narrow down your question? — Lasse V. Karlsen, Jun 25 '19 at 08:48
Thinking in decimal terms about binary floating point formats is... time lost. Probably the easiest thing to get one's head around is that the decimal number `0.1` (or `0.01`) has no possible representation in IEEE floating point numbers, no matter how many bits you throw at the representation, just as the fraction `1/3` has no possible representation in decimal number systems, no matter how many numbers you throw at the problem. — spender, Jun 25 '19 at 08:48
The bottom line is that floating point types in modern programming languages is *not intended* to hold an accurate representation of a number, but something *close enough*. Yes, a lot of numbers *do* have an accurate representation in these types but that is a bonus. — Lasse V. Karlsen, Jun 25 '19 at 08:52
It's also worth noting (as observed by @John) that, as you are using constant values, you're looking at compiler behaviour, not run-time behaviour. — spender, Jun 25 '19 at 08:55
Let me give you *my guess* why `8-0.7 == 7.3` is true. My *guess* is that the inaccurate representation of `0.7`, when subtracted from `8`, returns a value that is identical to the inaccurate-but-close-enough representation of `7.3`, but that this doesn't hold true for `8.01 - 7 = 7.31`, however when trying that last part I actually get `True` so I cannot replicate your issue with that. — Lasse V. Karlsen, Jun 25 '19 at 08:55
@LasseVågsætherKarlsen Sorry for that, `8.01-7 == 7.31` did return true. I mistype the question. `18.01-7==17.31` is the one which returned `False`. I've edited it and I think your guess might be correct. — LEE Hau Chon, Jun 25 '19 at 09:11
To see what is going on do an substract, which should be 0.0. Then you see that there is some left over in the order of E-7~E-10. This is due to the inability of the computer implementation to accurately depict a decimal number. The difference between a float (32 byte) and a double (64 byte) is the total number of bytes the computer is using to represent this number. The double uses more decimals for the approximation and has thus a higher chance to get an inaccurate representation. You can work around this using : math.abs(( 0.8D-0.7D)-0.1D)<0.0000001D (something arbitraty small number.) — Mischa Vreeburg, Jun 25 '19 at 09:22
@MischaVreeburg its 32 bit and 64 bit. You are correct though, we have to remember computers only understand 1s and 0s. To really answer this question we have to go into Binary Arithmetic. Since floats are 32 bits and double's are 64 bits, their number values are actually represented by different number of bits. A 32 bit representation of the number "4" is different to that of a 64 bit representation. This gets even more complicated when you get 2's compliment involved for negative numbers. Have a read of binary mathematics to understand this problem further. — Mahan.A, Jun 25 '19 at 10:56
@LasseVågsætherKarlsen: Explaining what is happening here is not so complex and does not require covering compiler optimizations, processor architecture, differences between CPU registers versus memory formats, et cetera. The only two issues are that C# specifies that IEEE-754 formats and arithmetic are used except that extra precision **may** be used. So the only variation is that extra precision may be used, and that can be discussed without being overly detailed about CPU registers and whatnot. Further, it has no effect in this question as these specific cases conform to straight IEEE-754. — Eric Postpischil, Jun 25 '19 at 12:27
@LasseVågsætherKarlsen: Re “The bottom line is that floating point types in modern programming languages is not intended to hold an accurate representation of a number, but something close enough.” That is a false statement of floating-point arithmetic. The IEEE-754 standard is quite clear about it: A floating-point datum represents a specific number **exactly**; it is not something “close.” The floating-point **operations** approximate real arithmetic. The **numbers** are exact, the **operations** are approximate. This distinction is critical for analysis, design, and writing proofs. — Eric Postpischil, Jun 25 '19 at 12:29
My bad, I did not clearly write what I meant to write. What I meant was that they are not meant to accurately represent the exact same number as we humans would write, so if you want to represent 0.17, that floating point types are not meant to be able to necessarily represent that number accurately but something close to 0.17. I agree fully, and I didn't write this clear, that a floating point number is exact, it's just very often not the exact value you wanted it to be. — Lasse V. Karlsen, Jun 25 '19 at 17:41

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

The Cases of 0.8−0.7

In 0.8-0.7 == 0.1, none of the literals are exactly representable in double. The nearest representable values are 0.8000000000000000444089209850062616169452667236328125 for .8, 0.6999999999999999555910790149937383830547332763671875 for .7, and 0.1000000000000000055511151231257827021181583404541015625 for .1. When the first two are subtracted, the result is 0.100000000000000088817841970012523233890533447265625. As this is not equal to the third, 0.8-0.7 == 0.1 evaluates to false.

In (float)(0.8-0.7) == (float)(0.1), the result of 0.8-0.7 and 0.1 are each converted to float. The float value nearest to the former, 0.1000000000000000055511151231257827021181583404541015625, is 0.100000001490116119384765625. The float value nearest to the latter, 0.100000000000000088817841970012523233890533447265625, is 0.100000001490116119384765625. Since these are the same, (float)(0.8-0.7) == (float)(0.1) evaluates to true.

In (double)(0.8-0.7) == (double)(0.1), the result of 0.8-0.7 and 0.1 are each converted to double. Since they are already double, there is no effect, and the result is the same as for 0.8-0.7 == 0.1.

Notes

The C# specification, version 5.0 indicates that float and double are the IEEE-754 32-bit and 64-bit floating-point types. I do not see it explicitly state they are the binary floating-point formats rather than decimal formats, but the characteristics described make this evident. The specification also states that IEEE-754 arithmetic is generally used, with round-to-nearest (presumably round-to-nearest-ties-to-even), subject to the exception below.

The C# specification allows floating-point arithmetic to be performed with more precision than the nominal type. Clause 4.1.6 says “… Floating-point operations may be performed with higher precision than the result type of the operation…” This can complicate analysis of floating-point expressions in general, but it does not concern us in the instance of 0.8-0.7 == 0.1 because the only applicable operation is the subtraction of 0.7 from 0.8, and these numbers are in the same binade (have the same power of two in the floating-point representation), so the result of the subtraction is exactly representable and additional precision will not change the result. As long as the conversion of the source texts 0.8, 0.7, and 0.1 to double does not use extra precision and the cast to float produces a float with no extra precision, the results will be as stated above. (The C# standard says in clause 6.2.1 that a conversion from double to float yields a float value, although it does not explicitly state that no extra precision may be used at this point.)

Additional Cases

In 8-0.7 == 7.3, we have 8 for 8, 7.29999999999999982236431605997495353221893310546875 for 7.3, 0.6999999999999999555910790149937383830547332763671875 for 0.7, and 7.29999999999999982236431605997495353221893310546875 for 8-0.7, so the result is true.

Note that the additional precision allowed by the C# specification could affect the result of 8-0.7. A C# implementation that used extra precision for this operation could produce false for this case, as it would get a different result for 8-0.7.

In 18.01-0.7 == 17.31, we have 18.010000000000001563194018672220408916473388671875 for 18.01, 0.6999999999999999555910790149937383830547332763671875 for 0.7, 17.309999999999998721023075631819665431976318359375 for 17.31, and 17.31000000000000227373675443232059478759765625 for 18.01-0.7, so the result is false.

How is subtracting 8 difference from subtracting 18.01 if they both are subtracted by a floating point number?

18.01 is larger than 8 and requires a greater power of two in its floating-point representation. Similarly, the result of 18.01-0.7 is larger than that of 8-0.7. This means the bits in their significands (the fraction portion of the floating-point representation, which is scaled by the power of two) represent greater values, causing the rounding errors in the floating-point operations to be generally greater. In general, a floating-point format has a fixed span—there is a fixed distance from the high bit retained to the low bit retained. When you change to numbers with more bits on the left (high bits), some bits on the right (low bits) are pushed out, and the results change.

Difference between double and float in floating point accuracy

1 Answers1

The Cases of 0.8−0.7

Notes

Additional Cases