C++ internal representation of double/float

Question

I am unable to understand why C++ division behaves the way it does. I have a simple program which divides 1 by 10 (using VS 2003)

double dResult = 0.0;
dResult = 1.0/10.0;

I expect dResult to be 0.1, However i get 0.10000000000000001

Why do i get this value, whats the problem with internal representation of double/float
How can i get the correct value?

Thanks.

C and C++ use [IEEE-754](http://en.wikipedia.org/wiki/IEEE_754-2008), and using binary to represent base-10 floating point numbers can lead to inaccuracy like you're seeing. `0.1` is actually not representable in IEEE-754. — wkl, Dec 28 '11 at 14:12
@birryree Are you sure? I thought that both standards left that implementation defined. Now obviously in practice every CPU uses IEEE-754 (more or less at least) so it doesn't matter, but still.. — Voo, Dec 28 '11 at 15:18
A `double` has 64 bits, so there are at most 2^64 distinct numbers that it can represent. `0.1` is not one of them. It's as simple as that. — fredoverflow, Dec 28 '11 at 15:25
@Voo - looking at the standard, looks like you're right - C99 Annex F does mention the IEEE-754 support, but C++03 specifies that there can be specializations that don't conform to IEEE-754/IEC-559. Not sure about C++11. — wkl, Dec 28 '11 at 16:49
Link to [What Every Computer Scientist Should Know About Floating-Point Arithmetic](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) There is already a link to this in [C++ tag](http://stackoverflow.com/tags/c%2b%2b/info) — Martin York, Dec 28 '11 at 17:15
possible duplicate of [floating point issue](http://stackoverflow.com/questions/3733071/floating-point-issue) or another 100 similar examples. — Martin York, Dec 28 '11 at 17:18
Does this answer your question? [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) — eesiraed, Apr 23 '20 at 04:41

Oliver Charlesworth · Accepted Answer · 2011-12-28T14:46:23.497

3

Because ~~all~~ most modern processors use binary floating-point, which cannot exactly represent 0.1 (there is no way to represent 0.1 as m * 2^e with integer m and e).

If you want to see the "correct value", you can print it out with e.g.:

printf("%.1f\n", dResult);

edited Dec 28 '11 at 14:46

answered Dec 28 '11 at 14:11

Oliver Charlesworth

267,707
33
569
680

Just a nit, but not all modern processors use binary floating-point. IBM mainframes still use base 16, and Unisys mainframes base 8. (Neither of which can represent `0.1` exactly either, of course.) – James Kanze Dec 28 '11 at 14:26
@James: I was aware that IBM used to use base-16, but are they still releasing processors based on that? – Oliver Charlesworth Dec 28 '11 at 14:46
@Oli I thought all IBM processors implemented both FP variants in the last few years, so presumably you can tell the compiler which variant to use? – Voo Dec 28 '11 at 15:40
@OliCharlesworth Very definitely. Their system Z. Current models support both IEEE and their native format, but the last time I checked (admittedly some years ago), the native format was about twice the speed of the IEEE; the IEEE was mainly their to support Java. – James Kanze Dec 28 '11 at 18:50
thanks Oli. thats the way to 'see' the correct value though i want to 'use' or 'get' the correct value as i will be using it for further calculation. Is there any way to get the error value ? so that i can subtract it from the final answer later ? – Waseem Dec 29 '11 at 05:06

score 2 · Answer 2 · answered Dec 28 '11 at 14:11

2

Double and float are not identical to real numbers, it is because there are infinite values for real numbers, but only finite number of bits to represent them in double/float.

You can further read: what every computer scientist should know about floating point arithmetics

answered Dec 28 '11 at 14:11

amit

175,853
27
231
333

2

This article regularly gets linked in response to this kind of question, but it's really not aimed at beginners... – Oliver Charlesworth Dec 28 '11 at 14:14
@OliCharlesworth Which is a good reason why beginners shouldn't use floating point. It doesn't work the way they expect. (But then, one could almost say the same thing about C++ in general. Or any other programming language, for that matter.) – James Kanze Dec 28 '11 at 14:27
he still have rational number not irrational. – Luka Rahne Dec 28 '11 at 14:36
@ralu: there are infinite number of those as well, even in the range [0,1]. [and in any range which is not a singelton] – amit Dec 28 '11 at 14:37
1

@James That's a bit harsh imo. You can explain the limitations of fp math in such a way that even people with no math background can easily understand the problems. And if they understand the simple, basic principle (You can't represent all numbers exactly) they can use it just as fine as anyone else. Provocative: Knowing **why** I can't use FP to store money isn't that much more useful than knowing I **shouldn't** use FP to store money ;) – Voo Dec 28 '11 at 15:25
@Voo Unless they can understand the implications in the cited work, they don't know enough to be able to distinguish when they can safely use machine floating point, and when they can't. – James Kanze Dec 28 '11 at 18:52
@James Is that so? In which cases does "If you want to rely on the number being represented "exactly", don't use floating points but a library, if you can live with some precision loss and performance is important use floats" fail? Seems pretty easy to me. – Voo Dec 28 '11 at 21:19
@Voo It may seem pretty easy, but your specification is completely wrong, and will lead to significant errors in many cases. First, of course, there's no such thing as representing a real number exactly, regardless of the library. And "some precision loss" is meaningless. How much depends on the actual operations (and it's easy to end up with results which are off by more than a magnitude). – James Kanze Dec 29 '11 at 08:33
@James Lucky me then that I put exactly under quotes - I wonder why (although I'm interested why you think we can't represent rationals exactly). Is it a correct mathematical definition? I hope not, otherwise it'd completely miss its goal, but: Tell me one situation where this rule of thumb doesn't work. The only thing that can happen is that the person uses a library in a situation where fp would be fine as well, not much of a problem that. – Voo Dec 29 '11 at 15:05
@Voo You can represent many rationals exactly. But not all, and not all real numbers are rationals. As for the rule of thumb, I'd be more interested in hearing about a case where it does work. Perhaps if all values have same magnitude, and there are only a couple of operations on each value? (In general, for example, addition isn't associative in machine floating point, and not being aware of this, and how to cope with it, can cause enormous loss of precision in very simple expressions.) – James Kanze Dec 29 '11 at 17:00
@James Oh I do know that there are irrationals, but you said there's no such thing as representing a real (which includes all rationals) exactly which just isn't true (also we can represent ALL rationals exactly not only "many"). And when it does work? Let's see: Money? Well I sure don't want 1.10 turning out to be 1.09998 there - so use a library. GPS coordinate calculations for general usage? No problem there, let's use floats. And yes I do know Kahan and all the stuff, but it's of use only for a few areas and there you're really better off using existing libraries anyhow (faster and better) – Voo Dec 29 '11 at 17:19

score 2 · Answer 3 · answered Dec 28 '11 at 14:15

The ubiquitous IEEE754 floating point format expresses floating point numbers in scientific notation base 2, with a finite mantissa. Since a fraction like 1/5 (and hence 1/10) does not have a presentation with finitely many digits in binary scientific notation, you cannot represent the value 0.1 exactly. More generally, the only values that can be represented exactly are those that fit precisely into binary scientific notation with a mantissa of a few (e.g. 24 or 53 or 64) binary digits, and a suitably small exponent.

Kemin Zhou · Answer 4 · 2019-12-25T21:51:53.723

Working with integers, floats, and doubles could be tricky. Depends on what is your purpose. If you only want to display in nice format, then you can play with the C++ iomanipulator, precision, showpint, noshowpint. If you are trying to do precise computing with numeric methods, you may have to use some library for accurate representation. If you are multiplying a lots of small and large number, you may have to resole to use log transformations. Here is a small test:

  float x=1.0000001;
   cout << x << endl;
   float y=9.9999999999999;
   cout << "using default io format " << y/x << endl;
   cout << showpoint << "using showpoint " << y/x << endl;
   y=9.9999;
   cout <<  "fewer 9 default C++ " << y/x << endl;
   cout << showpoint << "fewer 9 showpoint" << y/x << endl;

1
using default io format 10
using showpoint 10.0000
fewer 9 default C++ 9.99990
fewer 9 showpoint9.99990

In special cases you want to use double (which may be the result of some complicated algorithm) to represent integer numbers, you have to figure out the proper conversion method. Once I had a situation where I want to use a single double value to store two type of values: -1, +1, or (0-1) to make my code more memory efficient (and speed, large memory tends to reduce performance). It is a little tricky to distinguish between +1 and val < 1. In this case I know that the values < 1 has a resolution say only 1/500, Then I can safely use floor(val+0.000001) to get back the 1 value that I initially stored.

C++ internal representation of double/float

4 Answers4

Linked