1

just wanna clear about these case below:

#define MAP_CELL_SIZE_MIN 0.1f

float mMapHeight = 256;
float mScrHeight = 320;

int mNumRowMax;

case 1:

mNumRowMax = mMapHeight/( MAP_CELL_SIZE_MIN * mScrHeight );

mNumRowMax is now 7, but actually it must be 8 ( 256/32 ), and if I change the define of MAP_CELL_SIZE_MIN to only 0.1 then it goes true, mNumRowMax is 8, so what's wrong with the 'f'

case 2:

float tmp = mMapHeight/( MAP_CELL_SIZE_MIN * mScrHeight );//tmp = 8.0
mNumRowMax = tmp;

mNumRowMax is now 8, so can anybody help me understand what is wrong with the first case when mNumRowMax is 7

Mysticial
  • 464,885
  • 45
  • 335
  • 332
nvhausid
  • 149
  • 2
  • 4
  • Seems like you could refactor this code to use integers and avoid floating point entirely. Your math would be different (e.g. divide by 10 instead of multiply by 0.1) but guaranteed to be precise. Is there a reason to use floating point, given that your end result is an integer? –  Apr 18 '12 at 02:42
  • 1
    this piece is just to show the problem i meet, my real code is different and i need to use this formation. i now use 0.1 instead of 0.1f, just wanna understand the problem – nvhausid Apr 18 '12 at 03:23
  • 1
    Possible duplicate of [C++ float to int](http://stackoverflow.com/questions/3127962/c-float-to-int) – STF Jan 05 '16 at 09:17

3 Answers3

2

What happens is

5 [expr]

10 The values of floating operands and of the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.55)

55) The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.

(C++03; practically identical 6.3.1.8(2) in C99 and the n1570 draft of C11; I'm confident that the gist is identical in C++11.)

In the following, I assume an IEEE-754 like binary floating point representation.

In a fractional hexadecimal notation,

1/10 = 1/2 * 3/15
     = 1/2 * 0.33333333333...
     = 2^(-4) * 1.999999999...

so when that is rounded to b bits of precision, you get

2^(-4) * 1.99...9a   // if b ≡ 0 (mod 4) or b ≡ 1 (mod 4)
2^(-4) * 1.99...98   // if b ≡ 2 (mod 4) or b ≡ 3 (mod 4)

where the last hex-digit in the fractional part is truncated after the 3,4,1,2 most significant bits respectively.

Now 320 = 2^6*(2^2 + 1), so the result of r * 320 where r is 0.1 rounded to b bits, is, in full precision (ignoring the power of 2),

   6.66...68
 + 1.99...9a
 -----------
   8.00...02

with b+3 bits for b ≡ 0 (mod 4) or b ≡ 1 (mod 4) and

   6.66...60
 + 1.99...98
 -----------
   7.ff...f8

with b+2 bits for b ≡ 2 (mod 4) or b ≡ 3 (mod 4).

In each case, rounding the result to b bits of precision yields exactly 32 and then you get 256/32 = 8 as a final result. But if the intermediate result with greater precision is used, the calculated result of

256/(0.1 * 320)

is slightly smaller or larger than 8.

With the typical 32-bit float with 24 (23+1) bits of precision, if the intermediate results are represented with a precision of at least 53 bits:

0.1f = 1.99999ap-4
0.1f * 320 = 32*(1 + 2^(-26))
256/(0.1f * 320) = 8/(1 + 2^(-26)) = 8 * (1 - 2^(-26) + 2^(-52) - ...)

In case 1, the result is directly converted¹ to int from the intermediate result. Since the intermediate result is slightly smaller than 8, it gets truncated to 7.

In case 2, the intermediate result is stored in a float before converting to int, hence it is rounded to 24 bits of precision first, resulting in exactly 8.

Now if you leave off the f suffix, 0.1 is a double (presumably with 53 bits of precision), the two floats are promoted to double for the calculation, and

0.1 = 1.999999999999ap-4
0.1 * 320 = 32*(1 + 2^(-55))
256/(0.1 * 320) = 8 * (1 - 2^(-55) + 2^(-110) - ...)

If the calculation is performed at double precision 1 + 2^(-55) == 1 and already 0.1 * 320 == 32.

If the calculation is performed at extended precision with 64 bits of precision (think x87) or more, it is likely that the literal 0.1 isn't converted to double precision at all and directly used with the extended precision, which again leads to the multiplication 0.1 * 320 resulting in exactly 32.

If the literal 0.1 is used at double precision but the calculation is performed at higher precision, it would again yield 7 if the intermediate result is directly truncated to int from the representation with greater precision and 8 if the excess precision is removed before the conversion to int.

(Aside: gcc/g++ 4.5.1 yields 8 for all cases, regardless of optimisation level, on my 64-bit box, I haven't tried on a 32-bit box.)

¹ I'm not entirely sure, but I think that's a violation of the standard, it should first remove the excess precision. Any language lawyers?

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • I am confused what will happen if I **cast** from `float` to `int`. Because both float and int are "encoded" differently, won't converting the bits directly cause the values to be different. How does languages like C handle float casts? EDIT: People on this website suggested I use `static_cast(floatVar)`. Does this perform the necessary steps to have a safe cast? – Mathew Kurian Dec 19 '13 at 18:06
  • Typically, there's a machine instruction for that conversion, and the compiler will just use that. If there is no such machine instruction, the implementation (if there is one for such an inconvenient platform) will provide its own implementation of the conversion according (hopefully) to the standard. If the representation of floating point numbers is as described by IEEE-754, it's just testing sign and exponent, and a bit of masking and shifting, since the representations are closely related. – Daniel Fischer Dec 19 '13 at 18:13
  • That is neat. Thank you. Just to clarify, what is the difference between `static_cast` and `(int)` in this case> – Mathew Kurian Dec 19 '13 at 18:15
  • Regarding the edit, @mk1, with any sane compiler, a `static_cast(x)`, an `(int)x`, and an implicit conversion (if the compiler allows it), all result in the same code produced, and the same runtime result. Using `static_cast` has the advantage of being clean and clear about your intentions. – Daniel Fischer Dec 19 '13 at 18:15
  • Thank you! Last question for you :-) Since `static_cast` are meant for implicit casts, why is it better to use `static_cast` because you are actually converting types (as opposed to a scenario where you are trying to cast from `void*` to `MClass*`)? – Mathew Kurian Dec 19 '13 at 18:18
  • `static_cast` is meant for explicit standard conversions, like converting between integer and floating point types, or different integer types. Every cast is an explicit conversion. Some of these conversions are no-ops (converting from a signed integer type to the corresponding unsigned type usually is a no-op), others aren't, like conversions between integer and floating point types. – Daniel Fischer Dec 19 '13 at 18:25
0

When a floating point number is casted to an integer, the value is truncated and not rounded, i.e. all decimals are just "chopped off".

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
0

It appears you are running into rounding errors.

A simple fix might be to use double instead of float.

If that's not an option, then you might need to round to the integer. For example, if you have a floating point value f, do the equivalent of int x = (int)(f + 0.5);

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466