How does casting from float to double work in C++?

Question

The mantissa bits in a float variable are 23 in total while in a double variable 53.

This means that the digits that can be represented precisely by

a float variable are log10(2^23) = 6.92368990027 = 6 in total

and by a double variable log10(2^53) = 15.9545897702 = 15

Let's look at this code:

float b = 1.12391;
double d = b;
cout<<setprecision(15)<<d<<endl;

It prints

1.12390995025635

however this code:

double b = 1.12391;
double d = b;
cout<<setprecision(15)<<d<<endl;

prints 1.12391

Can someone explain why I get different results? I converted a float variable of 6 digits to double, the compiler must know that these 6 digits are important. Why? Because I'm not using more digits that can't all be represented correctly in a float variable. So instead of printing the correct value it decides to print something else.

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html — AK_, Jun 22 '14 at 21:42
If you count double's significand as 53 bits then float is 24. Or double 52 and float 23 if you are talking about explicit significand bits. — Pascal Cuoq, Jun 22 '14 at 21:44
@AK_ The document you link to does not contain the answer to the question. Did you even read it? — Pascal Cuoq, Jun 22 '14 at 21:48
@PascalCuoq of course I did, and it sort of does... But it doesnt matter, i wasnt answering the question, its useful background knowladge — AK_, Jun 22 '14 at 21:59
@EJP it really doesn't it very much depends on the platform, compiler, and optimizations :-( — AK_, Jun 22 '14 at 22:01
@AK_ The IEEE 754 specification gives the rules for this conversion. If you wish to claim there are applicable compiler optimizations or platform dependencies either please provide some evidence. — user207421, Jun 22 '14 at 22:08
@AK_ No, none of your two comments can be described as relevant to the question. It is perfectly possible to predict what each snippet in the question does, as long as you are not compiling for a Cray computer from 1979. And Goldberg's reports is not the best source of information to recommend for that. For one thing, it only describes floating-point in abstracto, and not how it ties with C++ for modern, widespread compilers (I am not saying that C++ mandates IEEE 754, only that it is a rare compiler that doesn't implement it). — Pascal Cuoq, Jun 22 '14 at 22:10

Pascal Cuoq · Accepted Answer · 2014-06-22T22:56:10.217

Converting from float to double preserves the value. For this reason, in the first snippet, d contains exactly the approximation to float precision of 112391/100000. The rational 112391/100000 is stored in the float format as 9428040 / 2²³. If you carry out this division, the result is exactly 1.12390995025634765625: the float approximation is not very close. cout << prints the representation to 14 digits after the decimal point. The first omitted digit is 7, so the last printed digit, 4, is rounded up to 5.

In the second snippet, d contains the approximation to double precision of the value of 112391/100000, 1.123909999999999964614971759147010743618011474609375 (in other words 5061640657197974 / 2⁵²). This approximation is much closer to the rational. If it was printed with 14 digits after the decimal point, the last digits would all be zeroes (after rounding because the first omitted digit would be 9). cout << does not print trailing zeroes, so you see 1.12391 as output.

Because I'm not using more digits that can't all be represented correctly in a float variable

When you incorrectly apply log10 to 2²³ (it should be 2²⁴), you get the number of decimal digits that can be stored in a float. Because float's representation is not decimal, the digits after these seven or so are not zeroes in general. They are the digits that happen to be there in decimal for the closest binary representation that the compiler chose for the number you wrote.

score 1 · Answer 2 · answered Jun 22 '14 at 21:53

1

float b = 1.12391;

The problem is here, and here:

double b = 1.12391;

These assignments are already imprecise. Calculations or casts using them will therefore also be imprecise.

answered Jun 22 '14 at 21:53

user207421

305,947
44
307
483

score 0 · Answer 3 · answered Jun 22 '14 at 22:24

0

You're mistaken in assuming that the first 6 digits will be precisely the same. When we say that float is precise to within 6 (decimal) digits, we mean that the relative difference between the actual and intended value is less than 10^-6. So, 1.12390995 and 1.12391 differ by 0.0000005. That's much better than the 10^-6 you can rely on.

answered Jun 22 '14 at 22:24

MSalters

173,980
10
155
350

Actually the “log10(2^23)” computation in the question is wrong. But regardless, the whole point of applying `log10` is to obtain a number of decimal digits (and respectable people have done it, so it can't be that bad an idea). If you are going to use the number n you get (7 if you do it correctly) in the sentence “the relative different is less than 10^-n”, you had better keep the 2^-24 you had in the first place. – Pascal Cuoq Jun 22 '14 at 22:33

How does casting from float to double work in C++?

3 Answers3