How are results rounded in floating-point arithmetic?

Question

I wrote this code that simply sums a list of n numbers, to practice with floating point arithmetic, and I don't understand this:

I am working with float, this means I have 7 digits of precision, therefore, if I do the operation 10002*10002=100040004, the result in data type float will be 100040000.000000, since I lost any digit beyond the 7th (the program still knows the exponent, as seen here).

If the input in this program is

You will see that, however, when this program computes 30003*30003=900180009 we have 30003*30003=900180032.000000

I understand this 32 appears becasue I am working with float, and my goal is not to make the program more precise but understand why this is happening. Why is it 900180032.000000 and not 900180000.000000? Why does this decimal noise (32) appear in 30003*30003 and not in 10002*10002 even when the magnitude of the numbers are the same? Thank you for your time.

#include <stdio.h>
#include <math.h>
#define MAX_SIZE 200


int main() 
{
int numbers[MAX_SIZE]; 
int i, N;
float sum=0;
float sumb=0;
float sumc=0;

printf("introduce n" );
scanf("%d", &N);

printf("write %d numbers:\n", N);
for(i=0; i<N; i++)
{
    scanf("%d", &numbers[i]);
}

int r=0;

while (r<N){
    sum=sum+numbers[r];
    sumb=sumb+(numbers[r]*numbers[r]); 
    printf("sum is %f\n",sum);
    printf("sumb is %f\n",sumb);
    r++;
}
sumc=(sum*sum);
printf("sumc is %f\n",sumc);
}

Sometimes, that's as close as you can get with the representation used and the bits available. It's floating-point. It happens.... — Martin James, Mar 09 '18 at 10:48
floating point accumulation error when summing. you can reduce the effect by using `double` — Jean-François Fabre, Mar 09 '18 at 10:49
I would also refer you to [this answer I wrote](https://stackoverflow.com/a/29846080/56541) a while ago. — David Z, Mar 09 '18 at 10:56
The magnitude of your numbers is not the same. They differ by factor 9 which means 3-4 extra bits for binary representation. Therefore the error introduced by rounding to the next higher or lower value that can be stored, is larger by a factor of 8 or 16. While rounding your first number causes an error of -4, it could be +/-64 with the second number. — Gerhardh, Mar 09 '18 at 11:37
Limiting to 7 decimal digits does not mean that the following digits are set to 0, but they are simply not precise. — Gerhardh, Mar 09 '18 at 11:37
No, 100040000.000000 has 15 significant digits. Float has 24 bits in the mantissa (23 stored + 1 implicit) and can therefore represent only 2^24 = 16777216 distinct values. Or log10(16777216) = 7.2 significant digits. You lose 0.5 bit due to rounding on each conversion and calculation. So when you convert from decimal to binary and again back to decimal to display the value then you are left with 23 bits of precision at most. log10(2^23) = 6 significant digits. Brutal math, float would have worked much better with one more bit in the mantissa. — Hans Passant, Mar 09 '18 at 11:54
The bottom line is that you will usually get that "noise" except, on rare occasions, when the "noise" just happens to be 000. Here's another example. `float f = 12345e20` comes out as 1234500068545006330183680, which looks totally bizarre at first, because "12345" looks like it requires just 5 significant digits, so the result ought to be exact, right? But internally it's in binary, not decimal, so those multiple-of-10 0's aren't natural. And, look again, 1234500068545006330183680 *is* correct -- to 8 digits, more or less as expected for a `float`. — Steve Summit, Mar 09 '18 at 12:14

Eric Postpischil · Accepted Answer · 2018-03-09T13:27:08.497

As explained below, the computed result of multiplying 10,002 by 10,002 must be a multiple of eight, and the computed result of multiplying 30,003 by 30,003 must be a multiple of 64, due to the magnitudes of the numbers and the number of bits available for representing them. Although your question asks about “decimal noise,” there are no decimal digits involved here. The results are entirely due to rounding to multiples of powers of two. (Your C implementation appears to use the common IEEE 754 format for binary floating-point.)

When you multiply 10,002 by 10,002, the computed result must be a multiple of eight. I will explain why below. The mathematical result is 100,040,004. The nearest multiples of eight are 100,040,000 and 100,040,008. They are equally far from the exact result, and the rule used to break ties chooses the even multiple (100,040,000 is eight times 12,505,000, an even number, while 100,040,008 is eight times 12,505,001, an odd number).

Many C implementations use IEEE 754 32-bit basic binary floating-point for float. In this format, a number is represented as an integer M multiplied by a power of two 2^e. The integer M must be less than 2²⁴ in magnitude. The exponent e may be from −149 to 104. These limits come from the numbers of bits used to represent the integer and the exponent.

So all float values in this format have the value M • 2^e for some M and some e. There are no decimal digits in the format, just an integer multiplied by a power of two.

Consider the number 100,040,004. The biggest M we can use is 16,777,215 (2²⁴−1). That is not big enough that we can write 100,040,004 as M • 2⁰. So we must increase the exponent. Even with 2², the biggest we can get is 16,777,215 • 2² = 67,108,860. So we must use 2³. And that is why the computed result must be a multiple of eight, in this case.

So, to produce a result for 10,002•10,002 in float, the computer uses 12,505,000 • 2³, which is 100,040,000.

In 30,003•30,003, the result must be a multiple of 64. The exact result is 900,180,009. 2⁵ is not enough because 16,777,215•2⁵ is 536,870,880. So we need 2⁶, which is 64. The two nearest multiples of 64 are 900,179,968 and 900,180,032. In this case, the latter is closer (23 away versus 41 away), so it is chosen.

(While I have described the format as an integer times a power of two, it can also be described as a binary numeral with one binary digit before the radix point and 23 binary digits after it, with the exponent range adjusted to compensate. These are mathematically equivalent. The IEEE 754 standard uses the latter description. Textbooks may use the former description because it makes analyzing some of the numerical properties easier.)

Thank you for your answer it is very clarifying, however, why has the computer to pick $M=16,777,215$? It could pick a lower number but a larger exponent $e$, and then compute the result more precisely. — codingnight, Mar 09 '18 at 15:23
@codingnight: Suppose you have larger exponent, say 2**4 instead of 2**3. Then the number has to be a multiple of 16 instead of a multiple of 8. That makes it even more restrictive. Every number that is a multiple of 16 is a multiple of 8. For smaller spacing between values, you want a lower exponent. The only reason to increase the exponent is that the magnitude of the number requires it. — Eric Postpischil, Mar 10 '18 at 03:17

Trevor · Answer 2 · 2018-03-09T11:48:37.750

Floating point arithmetic is done in binary, not in decimal.

Floats actually have 24 binary bits of precision, 1 of which is a sign bit and 23 of which are called significand bits. This converts to approximately 7 decimal digits of precision.

The number you're looking at, 900180032, is already 9 digits long and so it makes sense that the last two digits (the 32) might be wrong. The rounding like the arithmetic is done in binary, the reason for the difference in rounding can only be seen if you break things down into binary.

900180032 = 110101101001111010100001000000

900180000 = 110101101001111010100000100000

If you count from the first 1 to the last 1 in each of those numbers (the part I put in bold), that is how many significand bits it takes to store the number. 900180032 takes only 23 significand bits to store while 900180000 takes 24 significand bits which makes 900180000 an impossible number to store as floats only have 23 significand bits. 900180032 is the closest number to the correct answer, 900180009, that a float can store.

In the other example

100040000 = 101111101100111110101000000

100040004 = 101111101100111110101000100

The correct answer, 100040004 has 25 significand bits, too much for floats. The nearest number that has 23 or less significand bits is 10004000 which only has 21 significant bits.

For more on floating point arithmetic works, try here http://steve.hollasch.net/cgindex/coding/ieeefloat.html

Floating-point arithmetic is not always done in binary. It is also done in decimal. — Eric Postpischil, Mar 09 '18 at 11:43

How are results rounded in floating-point arithmetic?

2 Answers2