C floating point precision

Question

Possible Duplicate:
Floating point comparison

I have a problem about the accuracy of float in C/C++. When I execute the program below:

#include <stdio.h>

int main (void) {
    float a = 101.1;
    double b = 101.1;
    printf ("a: %f\n", a);
    printf ("b: %lf\n", b);
    return 0;
}

Result:

a: 101.099998
b: 101.100000

I believe float should have 32-bit so should be enough to store 101.1 Why?

score 17 · Accepted Answer · edited May 23 '17 at 10:28

You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together inverted powers of two (i.e., 2^-n like 1, 1/2, 1/4, 1/65536 and so on) subject to the number of bits available for precision.

There is no combination of inverted powers of two that will get you exactly to 101.1, within the scaling provided by floats (23 bits of precision) or doubles (52 bits of precision).

If you want a quick tutorial on how this inverted-power-of-two stuff works, see this answer.

Applying the knowledge from that answer to your 101.1 number (as a single precision float):

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm    1/n
0 10000101 10010100011001100110011
           |  | |   ||  ||  ||  |+- 8388608
           |  | |   ||  ||  ||  +-- 4194304
           |  | |   ||  ||  |+-----  524288
           |  | |   ||  ||  +------  262144
           |  | |   ||  |+---------   32768
           |  | |   ||  +----------   16384
           |  | |   |+-------------    2048
           |  | |   +--------------    1024
           |  | +------------------      64
           |  +--------------------      16
           +-----------------------       2

The mantissa part of that actually continues forever for 101.1:

mmmmmmmmm mmmm mmmm mmmm mm
100101000 1100 1100 1100 11|00 1100 (and so on).

hence it's not a matter of precision, no amount of finite bits will represent that number exactly in IEEE754 format.

Using the bits to calculate the actual number (closest approximation), the sign is positive. The exponent is 128+4+1 = 133 - 127 bias = 6, so the multiplier is 2⁶ or 64.

The mantissa consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2ⁿ) as n starts at 1 and increases to the right), {1/2, 1/16, 1/64, 1/1024, 1/2048, 1/16384, 1/32768, 1/262144, 1/524288, 1/4194304, 1/8388608}.

When you add all these up, you get 1.57968747615814208984375.

When you multiply that by the multiplier previously calculated, 64, you get 101.09999847412109375.

All numbers were calculated with bc using a scale of 100 decimal digits, resulting in a lot of trailing zeros, so the numbers should be very accurate. Doubly so, since I checked the result with:

#include <stdio.h>
int main (void) {
    float f = 101.1f;
    printf ("%.50f\n", f);
    return 0;
}

which also gave me 101.09999847412109375000....

"... only represent numbers exactly in [IEEE754](http://en.wikipedia.org/wiki/IEEE754) if they can be constructed from adding ... inverted powers of two ... " seems incomplete in that IEEE754 also defines floating point numbers with inverted powers of ten. Certainly IEEE754 binary formats are more common though. — chux - Reinstate Monica, Aug 06 '14 at 18:45
@chux, that's a very valid point, adjusted the answer to make that clear. — paxdiablo, Aug 06 '14 at 20:21

score 4 · Answer 2 · answered Sep 28 '12 at 07:33

4

You need to read more about how floating-point numbers work, especially the part on representable numbers.

You're not giving much of an explanation as to why you think that "32 bits should be enough for 101.1", so it's kind of hard to refute.

Binary floating-point numbers don't work well for all decimal numbers, since they basically store the number in, wait for it, base 2. As in binary.

This is a well-known fact, and it's the reason why e.g. money should never be handled in floating-point.

answered Sep 28 '12 at 07:33

unwind

391,730
64
469
606

1

can you illustrate by example how 101.1 is stored in computer? – Jeremy Sep 28 '12 at 07:35
2

101.1 can certainly be represented in 32 bits. Just not with any of the usual floating point formats supported by hardware. – James Kanze Sep 28 '12 at 07:57
@Jeremy It depends on the system. I'd recommend the Wikipedia article "floating point" for starters, although it doesn't give you enough information to actually start using them. The article [What Every Computer Scientist Should Know About Floating-Point Arithmetic] (http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) is about the best introduction I know. – James Kanze Sep 28 '12 at 08:01
For example, it can be represented in fixed point `999V9` BCD format i 16 bits as `0001 0001 0001 0001`. – paxdiablo Sep 28 '12 at 08:04

score 4 · Answer 3 · answered Sep 28 '12 at 07:39

Your number 101.1 in base 10 is 1100101.0(0011) in base 2. The 0011 part is repeating. Thus, no matter how many digits you'll have, the number cannot be represented exactly in the computer.

Looking at the IEE754 standard for floating points, you can find out why the double version seemed to show it entirely.

PS: Derivation of 101.1 in base 10 is 1100101.0(0011) in base 2:

101 = 64 + 32 + 4 + 1
101 -> 1100101

.1 * 2 =  .2 -> 0
.2 * 2 =  .4 -> 0
.4 * 2 =  .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 =  .4 -> 0
.4 * 2 =  .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 =  .4 -> 0
.4 * 2 =  .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 =  .4 -> 0
.4 * 2 =  .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 =  .4 -> 0
.4 * 2 =  .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2....

PPS: It's the same if you'd wanted to store exactly the result of 1/3 in base 10.

ouah · Answer 4 · 2012-09-28T07:44:07.243

3

If you had more digits to the print of the double you'll see that even double cannot be represented exactly:

 printf ("b: %.16f\n", b);

 b: 101.0999999999999943

The thing is float and double are using binary format and not all floating pointer numbers can be represented exactly with binary format.

edited Sep 28 '12 at 07:44

answered Sep 28 '12 at 07:34

ouah

142,963
15
272
331

score 2 · Answer 5 · answered Sep 28 '12 at 07:40

What you see here is the combination of two factors:

IEEE754 floating point representation is not capable of accurately representing a whole class of rational and all irrational numbers
The effects of rounding (by default here to 6 decimal places) in printf. That is say that the error when using a double occurs somewhere to the right of the 6th DP.

SingerOfTheFall · Answer 6 · 2012-09-28T08:00:11.130

1

Unfortunately, most decimal floating point numbers cannot be accurately represented in (machine) floating point. This is just how things work.

For instance, the number 101.1 in binary will be represented like 1100101.0(0011) ( the 0011 part will be repeated forever), so no matter how many bytes you have to store it, it will never become accurate. Here is a little article about binary representation of floating point, and here you can find some examples of converting floating point numbers to binary.

If you want to learn more on this subject, I could recommend you this article, though it's long and not too easy to read.

edited Sep 28 '12 at 08:00

answered Sep 28 '12 at 07:37

SingerOfTheFall

29,228
8
68
105

More a question of vocabulary, but I'd says "most real numbers cannot be accurately represented in (machine) floating point", or "most decimal floating point numbers cannot be accurately represented in (machine) floating point". (The latter is obviously only true if machine floating point isn't decimal. But while I've used machines with decimal floating point in the past, I think today only bases 2, 8 and 16 are still around.) – James Kanze Sep 28 '12 at 07:56

C floating point precision

6 Answers6

Linked

Related