Error in simple float calculations

Question

I am having trouble with:

int q = 150;
float s = 0.7f;
float z = q*s;
int z1 = (int) (q*s);
int z2 = (int) z;

This results in

z1 being int with value 104
z2 being int with value 105

Can anyone explain this? I do not understand these results.

To avoid closing, I (René Vogt) add this information:

q*s results in a float of value 105.0f (or maybe 104.999999, but a string represention ends up as 105).
so z is a float of 105

The question now is, why (int)z results in 105, but (int)(q*s) results in 104? I can reproduce this on my machine (i7, Win10, VS2015, .NET4.6.1)

And the IL code:

// Initialisation
// q = 150
ldc.i4 0x96  
stloc.0
// s = 0.7f
ldc.r4 0.69999999 
stloc.1

// calculating z 
ldloc.0 // load q
conv.r4 // convert to float
ldloc.1 // load s
mul     // q*s
stloc.2 // store float result to z

// calulating z1
ldloc.0 // load q
conv.r4 // convert to float
ldloc.1 // load s
mul     // q*s
conv.i4 // convert to int
stloc.3 // store int result in z1 => 104!

// calculating z2
ldloc.2 // loading float z
conv.i4 // converting to int
stloc.s // write to z2 (last local variable -> "s" as stack address)
        // => 105

So I think the only difference between z1 and z2 is that for z2 the intermediate float result gets written from the register to the local variable's (z) storage place. But how would that affect the result?

What are the types of `z`/`z1`/`z2`? Please learn how to provide an example that actually compiles. — DavidG, Feb 06 '17 at 11:34
@AndrewPaes Actually that would make even more compile time errors. — DavidG, Feb 06 '17 at 11:37
no, `float`*`int` is `float`. My normal answer to those question is something like "floating point arithmetics", "rounding issue" or "integer blabla", but this really puzzles me — René Vogt, Feb 06 '17 at 11:38
I would guess that `z` is float, `z1 \ z2` are int since that's what they're cast to. — Equalsk, Feb 06 '17 at 11:39
@DavidG you are right that OP should provide that, but you can easily reproduce this, declaring all as `var`, resulting in `float z`, `int z1` and `int z2`, and it still resutls in `z1 = 104` and `z2 = 105`. — René Vogt, Feb 06 '17 at 11:40
Im' trying to achieve this but with no success - https://dotnetfiddle.net/rohBUQ — Andrew Paes, Feb 06 '17 at 11:42
@RenéVogt [Here](https://dotnetfiddle.net/KYpSRV) it results always to 105. — 3615, Feb 06 '17 at 11:42
I'd be interested to know what's going on under the hood as `var z1 = (int)(q*s);` = 104 while `var z1 = Convert.ToInt32(q*s);` = 105. — Equalsk, Feb 06 '17 at 11:49
The only difference in IL is that for `z2` the result of `q*s` was wriiten to a local variable's storage place and then reloaded and converted, while for `z1` the result of `q*s` is converted directly from the register without storing in a variable's place. Maybe that "storing" does some strange conversion/rounding... really hoping for a professional explanation. This "rounding" mistake seems somewhat scary... — René Vogt, Feb 06 '17 at 12:18
@DavidG His code compiler fine. var was added to c# in c# 3.0. This happened at least a couple years ago. Anyway, var tells the compiler to infer the type from the return type of the expression. So z would be float. z1 and z2 would be int because of the explicit typecast. — Skye MacMaster, Feb 06 '17 at 15:34
@ScottMacMaster I'm well aware of `var`, my comment referred to the [initial version of this question](http://stackoverflow.com/revisions/42066772/1) which did not compile. — DavidG, Feb 06 '17 at 15:36

Equalsk · Accepted Answer · 2017-02-06T15:21:08.963

6

The number 0.7 cannot be represented exactly by a float, instead the value of s is closer to 0.699999988079071044921875.
The int value of q will be converted to a float, as this can be represented directly it stays as 150.

If you multiply the two together you won't get 105 exactly:

q = 150
s = 0.699999988079071044921875
q * s = 104.999998211861

Now refer to the relevant part in the CLI Spec (ECMA-335) §12.1.3:

When a floating-point value whose internal representation has greater range and/or precision than its nominal type is put in a storage location, it is automatically coerced to the type of the storage location. This can involve a loss of precision or the creation of an out-of-range value (NaN, +infinity, or -infinity). However, the value might be retained in the internal representation for future use, if it is reloaded from the storage location without having been modified. It is the responsibility of the compiler to ensure that the retained value is still valid at the time of a subsequent load, taking into account the effects of aliasing and other execution threads (see memory model (§12.6)). This freedom to carry extra precision is not permitted, however, following the execution of an explicit conversion (conv.r4 or conv.r8), at which time the internal representation must be exactly representable in the associated type.

So q * s results in a value with higher precision than float can handle. When storing this directly to an int:

var z1 = (int)(q * s);

The value is never coerced to the type float, but directly cast to int and thereby truncated to 104.

In all other examples the value was cast to or stored in a float and therefore converted to the nearest possible float value, which is 105.

edited Feb 06 '17 at 15:21

answered Feb 06 '17 at 14:09

Equalsk

7,954
2
41
67

Sounds good, but I'm still not convinced...the result of `q*s` _is_ a `float`. Why does it get "rounded" to 105 when stored in a `float` variable but not when kept in a register (what is only my understanding of what happens "under the hood", but I already may be wrong using the term "register" here). – René Vogt Feb 06 '17 at 14:18
This doesn't explain the difference between the two options. Also, are you sure that 150 can be accurately be stored as a float? – DavidG Feb 06 '17 at 14:18
@RenéVogt Well this is the "under the hood" part that I'm not sure about and hoping for some guidance on. I think part of it is that the maths for computing `z` is done at compile time and `105` is 'hard code' while `q * s` is run-time and therefore `104.999..`. Again, I could be wrong, hoping that someone like Jon Skeet sees this and has a better answer as I'm out of my depth a little. – Equalsk Feb 06 '17 at 14:22
@DavidG I'm still learning about floating point arithmetic so please prove me wrong as it would be of benefit but I'm sure that 150 and 105 can both be represented accurately by float and only 0.7 cannot. – Equalsk Feb 06 '17 at 14:23
@Equalsk as you can see from the IL I posted, `z` is computed at run-time, not at compile-time. – René Vogt Feb 06 '17 at 14:23
@RenéVogt Thanks René, I missed that! ;-) – Equalsk Feb 06 '17 at 14:24
1

I tried and checked IL again with @DavidG's `(int)(float)(q*s)`...I think the "rounding" from `104.999...` to `105.0` is done by the `conv.r4` command, which "Convert[s] to float32, pushing F on stack" (wikipedia). This IL command is not issued when converting the result of `mul` directly to `int` via `conv.i4`, but always when converting (explicitly or implicitly) to `float`. – René Vogt Feb 06 '17 at 14:56
I'm not very good at reading IL, it's new-ish to me, but that sounds right. 104.999 can't be represented by a float, it's closest representation is actually 105 which is why z is 'rounded'. 104.999 cast directly to an int is obviously 104 when precision is lost. I've edited my post to hopefully be slightly clearer, although still missing the technicalities. – Equalsk Feb 06 '17 at 15:00
@RenéVogt Have tried to remove the fluff and make it clearer, hope that's OK with you. – Equalsk Feb 06 '17 at 15:21
@Equalsk that's okay, but now I feel a little uncomfortable with having posted the same answer again :) I guess I delete it again and we keep the community wiki for the sake of knowledge instead of rep. – René Vogt Feb 06 '17 at 15:31
1

I like that this is a CW, there was some good collaboration here. – DavidG Feb 06 '17 at 15:37
(PS I edited the question to remove `var` and replace with `float` for clarity) – DavidG Feb 06 '17 at 15:44
Of course, now I've figured out it's a dupe anyway :) – DavidG Feb 06 '17 at 15:49
Eh, I like our answer better, not that I'm biased... – Equalsk Feb 06 '17 at 15:51
I agree, the inclusion of the spec does help. – DavidG Feb 06 '17 at 15:52

score 0 · Answer 2 · edited May 23 '17 at 12:31

I am guessing you are running this on x64?

Therefore I am assuming that the multiplication happens as double. But then the double is cast back down to float for storage (and therefore being e.g. 105.000000001)

mul     // q*s
stloc.2 // store float result to z (stored as float)

When multiplying directly, and converting to int, the multiplication happens as double which could be represented as e.g. 104.999999, and truncated to int (104)

mul     // q*s
conv.i4 // convert to int

Check out this answer: Strange behaviour when casting a float to int in C#

Edit: In 1st instance, the compiler has no option but to use float arithmetic, as it needs to store it back in float. But for the second option where there is a direct cast to int, it can at its will, use higher precision arithmetic as per the linked answer.

score -1 · Answer 3 · answered Feb 06 '17 at 15:22

When converting to binary to decimal you can't represent the same numbers completly (rember binary only has 2 charachters where deciaml has 10).

A decimal equivalent is trying to write on paper 1/3 = 0.3333333333333... (infinity!)

As floats cannot represent recuring numbers this leads to a loss of precision as you end up storing just 0.333333 instead of 0.333333333333.... (simplified example) this is great for low precision applications and benefits from performance (when you are considering millions of calculations)

Instead you should use the Decimal or Double types as these are able to store numbers using a scientfic notation representation which dosn't loose precision as easily.

The best explanation I have ever come across for this is this: https://www.youtube.com/watch?v=PZRI1IfStY0

This question is not directly about floating point precision, it's about the difference between the 2 methods of calculating the result. — DavidG, Feb 06 '17 at 15:46

Error in simple float calculations

3 Answers3