What happens is
5 [expr]
10 The values of floating operands and of the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.55)
55) The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.
(C++03; practically identical 6.3.1.8(2) in C99 and the n1570 draft of C11; I'm confident that the gist is identical in C++11.)
In the following, I assume an IEEE-754 like binary floating point representation.
In a fractional hexadecimal notation,
1/10 = 1/2 * 3/15
= 1/2 * 0.33333333333...
= 2^(-4) * 1.999999999...
so when that is rounded to b
bits of precision, you get
2^(-4) * 1.99...9a // if b ≡ 0 (mod 4) or b ≡ 1 (mod 4)
2^(-4) * 1.99...98 // if b ≡ 2 (mod 4) or b ≡ 3 (mod 4)
where the last hex-digit in the fractional part is truncated after the 3,4,1,2 most significant bits respectively.
Now 320 = 2^6*(2^2 + 1)
, so the result of r * 320
where r
is 0.1
rounded to b
bits, is, in full precision (ignoring the power of 2),
6.66...68
+ 1.99...9a
-----------
8.00...02
with b+3
bits for b ≡ 0 (mod 4)
or b ≡ 1 (mod 4)
and
6.66...60
+ 1.99...98
-----------
7.ff...f8
with b+2
bits for b ≡ 2 (mod 4)
or b ≡ 3 (mod 4)
.
In each case, rounding the result to b
bits of precision yields exactly 32 and then you get 256/32 = 8
as a final result. But if the intermediate result with greater precision is used, the calculated result of
256/(0.1 * 320)
is slightly smaller or larger than 8.
With the typical 32-bit float
with 24 (23+1) bits of precision, if the intermediate results are represented with a precision of at least 53 bits:
0.1f = 1.99999ap-4
0.1f * 320 = 32*(1 + 2^(-26))
256/(0.1f * 320) = 8/(1 + 2^(-26)) = 8 * (1 - 2^(-26) + 2^(-52) - ...)
In case 1, the result is directly converted¹ to int
from the intermediate result. Since the intermediate result is slightly smaller than 8, it gets truncated to 7.
In case 2, the intermediate result is stored in a float
before converting to int
, hence it is rounded to 24 bits of precision first, resulting in exactly 8.
Now if you leave off the f
suffix, 0.1
is a double
(presumably with 53 bits of precision), the two float
s are promoted to double
for the calculation, and
0.1 = 1.999999999999ap-4
0.1 * 320 = 32*(1 + 2^(-55))
256/(0.1 * 320) = 8 * (1 - 2^(-55) + 2^(-110) - ...)
If the calculation is performed at double
precision 1 + 2^(-55) == 1
and already 0.1 * 320 == 32
.
If the calculation is performed at extended precision with 64 bits of precision (think x87) or more, it is likely that the literal 0.1
isn't converted to double
precision at all and directly used with the extended precision, which again leads to the multiplication 0.1 * 320
resulting in exactly 32.
If the literal 0.1
is used at double
precision but the calculation is performed at higher precision, it would again yield 7 if the intermediate result is directly truncated to int
from the representation with greater precision and 8 if the excess precision is removed before the conversion to int
.
(Aside: gcc/g++ 4.5.1 yields 8 for all cases, regardless of optimisation level, on my 64-bit box, I haven't tried on a 32-bit box.)
¹ I'm not entirely sure, but I think that's a violation of the standard, it should first remove the excess precision. Any language lawyers?