Why does this expression cause a floating point error?

Question

So Floating point operations are inexact but that doesn't fully explain what's going on here:

[46] pry(main)> a=0.05
=> 0.05
[47] pry(main)> a=a*26.0/65
=> 0.02

so here we have what we expect, we get the right answer and the world keeps turning beautifully. But we later rewrite this function and while we do this we swap the line a=a*26.0/65 for a*=26.0/65 isn't that nice, we typed one less character! Let's see how that's worked out for us?

[48] pry(main)> a=0.05
=> 0.05
[49] pry(main)> a*=26.0/65
=> 0.020000000000000004
[50] pry(main)> 26.0/65
=> 0.4

It shows that a*=b is not the same as writing a=a*b. it doesn't seem to be a normal float rounding error because none of these numbers should be rounded as a float (mantissa should be more than long enough for each of 26.0, 26.0/65, 65.0)

I'm sure there's something subtle going on under the hood and would like to know what's going on?

Reproduced with ruby 2.0.0p247 and 1.9.3p392 [x86_64-linux]. I like the part where "the world keeps turning beautifully" :) — tessi, Nov 20 '13 at 16:23

score 5 · Accepted Answer · answered Nov 20 '13 at 17:17

It is not true that the significand of the floating-point format has enough bits to represent 26/65. (“Significand” is the preferred term. Significands are linear. Mantissas are logarithmic.)

The significand of a binary floating-point number is a binary integer. This integer is scaled according to the exponent. To represent 26/65, which is .4, in binary floating-point, we must represent it as an integer multiplied by a power of two. For example, an approximation to .4 is 1•2^-1 = .5. A better approximation is 3•2^-3=.375. Better still is 26•2^-4 = .40625.

However, no matter what integer you use for the significand or what exponent you use, this format can never be exactly .4. Suppose you had .4 = f•2^e, where f and e are integers. Then 2/5 = f•2^e, so 2/(5f) = 2^e, and then 1/(5f) = 2^e-1 and 5f = 2^1-e. For that to be true, 5 would have to be a power of two. It is not, so you cannot have .4 = f•2^e.

In IEEE-754 64-bit binary floating-point, the significand has 53 bits. With this, the closest representable value to .4 is 0.40000000000000002220446049250313080847263336181640625, which equals 3602879701896397•2^-53.

Now let us look at your calculations. In a=0.05, 0.05 is converted to floating-point, which produces 0.05000000000000000277555756156289135105907917022705078125.

In a*26.0/65, a*26.0 is evaluated first. The exact mathematical result is rounded to the nearest representable value, producing 1.3000000000000000444089209850062616169452667236328125. Then this is divided by 65. Again, the answer is rounded, producing 0.0200000000000000004163336342344337026588618755340576171875. When Ruby prints this value, it apparently decides it is close enough to .02 that it can just display “.02” and not the complete value. This is reasonable in the sense that, if you convert the printed value .02 back to floating-point, you get the actual value again, 0.0200000000000000004163336342344337026588618755340576171875. So “.02” is in some sense a good representative for 0.0200000000000000004163336342344337026588618755340576171875.

In your alternative expression, you have a*=26.0/65. In this, 26.0/65 is evaluated first. This produces 0.40000000000000002220446049250313080847263336181640625. This is different from the first expression because you have performed the operations in a different order, so a different number was rounded. It may have happened that a value in the first expression was rounded down whereas this different value, because of where it happened to land relative to values representable in floating-point, rounded up.

Then the value is multiplied by a. This produces 0.02000000000000000388578058618804789148271083831787109375. Note that this value is further from .02 than the result of the first expression. Your implementation of Ruby knows this, so it determines that printing “.02” is not enough to represent it accurately. Instead, it displays more digits, showing 0.020000000000000004.

+1 Respect! Great explanation. I had the clue but you made this clear as water. — Paulo Bu, Nov 20 '13 at 17:22
Could I ask where you got the precise values of all of these results from? did you use a calculation or some tool that performs the calculation? and if you used ruby, how did you make it display all the digits? Thanks. — Mike H-R, May 02 '14 at 10:33
@MikeH-R: Apple’s standard C library, and the rest of its C implementation, produces correctly rounded results. So to get the exact values of floating-point numbers, all you need to do is request plenty of digits, as with `printf("%.999g\n", x);`. (Not all C implementations do this; some may produce results correct only to around 17 digits.) — Eric Postpischil, Apr 14 '23 at 11:08

score 3 · Answer 2 · edited May 23 '17 at 10:25

I think I get what is going on here. Take a look at this code and the order of operations:

irb(main):001:0> a=0.05
=> 0.05
irb(main):002:0> b=26.0
=> 26.0
irb(main):003:0> c=65
=> 65
irb(main):004:0> a*b/c
=> 0.02
irb(main):005:0> a*(b/c)
=> 0.020000000000000004

Here, a*b/c is how the interpreter should evaluate your expression a=a*26.0/65. It evaluates the right side and later assigns the result to the left side of the assignment.

Now, what exactly the operator *= does? If we force a modification of the order of the operations a*(b/c) the above code shows the result you are having then a*=b/c so under the hood I thinks Ruby's *= evaluates the left side of the expression and then multiply it by the right side and after that it assigns it to the right side.

In my point of view, this is what's happening. Ruby's interpreter is modifying the way the evaluation is performed and of course, because we're dealing with non precise floating point numbers, that can have great effects on the result as Jon Skeet explains in his amazing answer to this question: Why does changing the sum order returns a different result?

Hope this helps!

ok, pretty sure you're right about this as `[63] pry(main)> 0.05*0.4 => 0.020000000000000004` but I'm still confused, from my knowledge of floating point arithmetic this shouldn't happen, both of the representations should have enough bits to represent the number? I remember that the summation order matters but don't see how that applies here? (thanks again for the great answer) — Mike H-R, Nov 20 '13 at 16:33
I can see your point. Let's wait if someone with more background in floating arithmetic than me can unriddle this. I just assume it's ok because with floating point arithmetic I'm always prepared for the worst :) — Paulo Bu, Nov 20 '13 at 16:37

Why does this expression cause a floating point error?

2 Answers2

Linked