0

The mantissa bits in a float variable are 23 in total while in a double variable 53.

This means that the digits that can be represented precisely by

a float variable are log10(2^23) = 6.92368990027 = 6 in total

and by a double variable log10(2^53) = 15.9545897702 = 15

Let's look at this code:

float b = 1.12391;
double d = b;
cout<<setprecision(15)<<d<<endl;

It prints

1.12390995025635

however this code:

double b = 1.12391;
double d = b;
cout<<setprecision(15)<<d<<endl;

prints 1.12391

Can someone explain why I get different results? I converted a float variable of 6 digits to double, the compiler must know that these 6 digits are important. Why? Because I'm not using more digits that can't all be represented correctly in a float variable. So instead of printing the correct value it decides to print something else.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
ksm001
  • 3,772
  • 10
  • 36
  • 57
  • http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – AK_ Jun 22 '14 at 21:42
  • If you count double's significand as 53 bits then float is 24. Or double 52 and float 23 if you are talking about explicit significand bits. – Pascal Cuoq Jun 22 '14 at 21:44
  • 2
    @AK_ The document you link to does not contain the answer to the question. Did you even read it? – Pascal Cuoq Jun 22 '14 at 21:48
  • Don't understand why this question was downvoted. – adjan Jun 22 '14 at 21:49
  • @AK_ You are mistaken. The IEEE 754 specification knows. – user207421 Jun 22 '14 at 21:51
  • @PascalCuoq of course I did, and it sort of does... But it doesnt matter, i wasnt answering the question, its useful background knowladge – AK_ Jun 22 '14 at 21:59
  • @EJP it really doesn't it very much depends on the platform, compiler, and optimizations :-( – AK_ Jun 22 '14 at 22:01
  • @AK_ The IEEE 754 specification gives the rules for this conversion. If you wish to claim there are applicable compiler optimizations or platform dependencies either please provide some evidence. – user207421 Jun 22 '14 at 22:08
  • 4
    @AK_ No, none of your two comments can be described as relevant to the question. It is perfectly possible to predict what each snippet in the question does, as long as you are not compiling for a Cray computer from 1979. And Goldberg's reports is not the best source of information to recommend for that. For one thing, it only describes floating-point in abstracto, and not how it ties with C++ for modern, widespread compilers (I am not saying that C++ mandates IEEE 754, only that it is a rare compiler that doesn't implement it). – Pascal Cuoq Jun 22 '14 at 22:10

3 Answers3

4

Converting from float to double preserves the value. For this reason, in the first snippet, d contains exactly the approximation to float precision of 112391/100000. The rational 112391/100000 is stored in the float format as 9428040 / 223. If you carry out this division, the result is exactly 1.12390995025634765625: the float approximation is not very close. cout << prints the representation to 14 digits after the decimal point. The first omitted digit is 7, so the last printed digit, 4, is rounded up to 5.

In the second snippet, d contains the approximation to double precision of the value of 112391/100000, 1.123909999999999964614971759147010743618011474609375 (in other words 5061640657197974 / 252). This approximation is much closer to the rational. If it was printed with 14 digits after the decimal point, the last digits would all be zeroes (after rounding because the first omitted digit would be 9). cout << does not print trailing zeroes, so you see 1.12391 as output.

Because I'm not using more digits that can't all be represented correctly in a float variable

When you incorrectly apply log10 to 223 (it should be 224), you get the number of decimal digits that can be stored in a float. Because float's representation is not decimal, the digits after these seven or so are not zeroes in general. They are the digits that happen to be there in decimal for the closest binary representation that the compiler chose for the number you wrote.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
1

float b = 1.12391;

The problem is here, and here:

double b = 1.12391;

These assignments are already imprecise. Calculations or casts using them will therefore also be imprecise.

user207421
  • 305,947
  • 44
  • 307
  • 483
0

You're mistaken in assuming that the first 6 digits will be precisely the same. When we say that float is precise to within 6 (decimal) digits, we mean that the relative difference between the actual and intended value is less than 10-6. So, 1.12390995 and 1.12391 differ by 0.0000005. That's much better than the 10-6 you can rely on.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Actually the “log10(2^23)” computation in the question is wrong. But regardless, the whole point of applying `log10` is to obtain a number of decimal digits (and respectable people have done it, so it can't be that bad an idea). If you are going to use the number n you get (7 if you do it correctly) in the sentence “the relative different is less than 10^-n”, you had better keep the 2^-24 you had in the first place. – Pascal Cuoq Jun 22 '14 at 22:33