Sum of float is not exact

Question

Consider the following C++ program:

#include <iostream>
using std::cout;
using std::endl;

int main () {
   float x = 0.0001;
   float y = 0;
   for (int i=0; i < 10000; i++) {
      y += x;
   }
   cout << y << endl;
   return 0;
}

Compile and run this program and then answer the following questions: How does the actual behaviour of this program differ from its expected behaviour?

Why is the expected behaviour not seen?

While ensuring that the program semantics stay the same, what changes would you make to this program to ensure that the expected and the actual behaviour do match?

The above is my assignment. I know I am suppose to do my own homework but I'm stuck.

For part a) I simply said the 2 numbers are different.
For part c) I made the float into a double. (I think the semantic stays the same)
For part b) I know this is called catastrophic cancellation, but the prof probably wants to see more than that and I have no idea what else to say. Can someone help me?

Thanks for your help

This might be worth a read: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html — Seth, Oct 14 '12 at 01:59
To begin with, 0.0001 cannot be represented exactly with standard floating point values. Using fixed-point calculations is a good alternative. — Vaughn Cato, Oct 14 '12 at 02:00
For a floating point number to be represented in a finite number of binary digits, it must be a multiple of n/m^2. The fp number 0.0001 isn't. In base-2, it falls into a category similar to expressing 1/3rd in a base-10 system; it has a non-terminal expansion for base-2. — DavidO, Oct 14 '12 at 02:01
How does your professor know what behavior you expect (or did he tell you what HE expected)? I would expect that floating point arithmetic is, in the general case, approximate, so the result would not be exactly 1.0 (and, I suspect, more likely to be less than 1.0 than greater). I wouldn't expect that loss of significance would become a major factor until you piled on a couple more powers of ten, though. — Hot Licks, Oct 14 '12 at 02:15
It's hard to say what your professor calls "expected behavior": if you know what to expect, you'd know that the numbers wouldn't add up to 1. One fix would be to switch to powers of two: if you use `x = 0.0001220703125` and add it up 8192 times, you'd get the numbers summing up precisely to 1. — Sergey Kalinichenko, Oct 14 '12 at 02:18
Mathematically, the expected return should be 1. Instead I got 1.00005 — user44322, Oct 14 '12 at 02:19
To make fractions add up to what you expect mathematically, the numbers must be representable in floating point exactly. Negative powers of two and their sums are representable with `float`; decimal-base floating point numbers, such as `decimal` of .NET, would represent negative powers of ten exactly. Take a look at your modified program here: [link](http://ideone.com/9H5Qx) - it outputs 1. — Sergey Kalinichenko, Oct 14 '12 at 02:24
Remember, 32-bit IEEE float carries about 7 decimal digits of precision -- that's the BEST accuracy you can expect. And when you do addition 10,000 times you're going to lose even more accuracy. — Hot Licks, Oct 14 '12 at 02:24
I would say your professor assumes that your expectations will be that .00001 is representable in a finite number of base-2 digits, and that you will be surprised to discover it isn't. This is an exercise everyone who learns to program should discover and come to terms with. It's unbelievable how many times I see people get tripped up by this simple reality. Your professor is doing you a favor by forcing you to encounter it now rather than in a job interview. — DavidO, Oct 14 '12 at 02:43
@DavidO -- Odd, I never really had this problem. Maybe because the prof spent a half-hour or so explaining how floating point is inexact and giving a few examples. (It is easy to forget, however, especially with Objective-C and some other languages that are not fussy about what's int and what's float -- you lose track of which is which and get burned.) — Hot Licks, Oct 14 '12 at 14:09
[Is floating point math broken?](http://stackoverflow.com/q/588004/995714) — phuclv, Feb 23 '17 at 10:36

score 1 · Accepted Answer · answered Oct 14 '12 at 04:31

How does the actual behaviour of this program differ from its expected behaviour? -The actual behaviour of this program is adding up IEEE representation of 0.0001 up 10000 times; IEEE representation of 0.0001 != actual 0.0001

Why is the expected behaviour not seen? -We assume 0.0001 is exactly represented as 0.0001, in realty it isn't since IEEE floating points can't represent 0.0001 exactly due to having to represent all floating points in base2 instead of base10.

While ensuring that the program semantics stay the same, what changes would you make to this program to ensure that the expected and the actual behaviour do match? -Changing float to double will work in this case, because double gives you more decimal precision than float. -Alternative solution is to keep float and instead of doing summation, you assign y = 10000*x (this incurs less error and it's better when you're looking to avoid roundoff and approximation errors)

score 1 · Answer 2 · answered Oct 14 '12 at 11:02

I suppose the purported “expected behavior” of this program is to add .0001 to a sum initialized to zero 10,000 times, with all arithmetic being mathematically, yielding 1. The actual behavior is to convert the decimal numeral “.0001” to a double (likely an IEEE-754 64-bit binary floating-point), then to convert that value to a float (likely an IEEE-754 32-bit binary floating-point), then to add that float to a sum 10,000 times, using float arithmetic each time. Thus, the actual behavior has potential rounding errors when converting the numeral to a double, when converting the double to a float, and in each addition.

One way to avoid error in this situation is to use integer arithmetic. Instead of setting float x to .0001, we could set int x to 1. Similarly, y would be an int, and we would use all integer arithmetic until we were done with the loop. After obtaining the final sum, then we would convert it to float. Then we have to adjust for the scaling we used that allowed us to use integer arithmetic. Since we were adding 1 instead of .0001, we have to divide the final result by 10000.f to adjust it. (This technique will not completely avoid error in all situations, and there are other techniques for reducing error in other situations.)

There is no catastrophic cancellation, because there is no cancellation. Cancellation occurs when two numbers are added or subtracted to produce a smaller result (thus, when adding two numbers of opposite sign, such as adding +8 and -6 to get +2, or subtracting two numbers of the same sign, such as subtracting -6 from +8 to get +2). Catastrophic cancellation occurs when the result is much smaller than the original two numbers. In this situation, the value we are working with gets much smaller, but any error that was in the original numbers typically stays the same size, so the error is much larger relative to the value we are working with. E.g., suppose we are supposed to subtract 8 from 8.01 and get .01, but, due to a small error, we have 7.99 in place of 8. Subtracting 7.99 from 8.01 yields .02. This result, .02, is twice the desired result, .01, so the relative error is huge.

Sum of float is not exact

2 Answers2

Linked

Related