Rounding down float

Question

First let me give hex and binary for 32 bit floating point and decimal representation.

0x5060000f = 01010000011000000000000000001111 = 15032400896
0x43800005 = 01000011100000000000000000000101 = 256.000152587890625

My teacher says adding 1 time 0x5060000f and 24 time 0x43800005 floats has round down error in last 3 bits.

What does she mean?

15032400896 + 24 * 256.000152587890625 =
15032407040.003662109375 =
01010000011000000000000000010101 =
0x50600015

You should ask the teacher if something in her explanation is not clear for you — Alexander Elgin, Mar 01 '18 at 06:06
Possible duplicate of [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) — Dijkgraaf, Mar 01 '18 at 17:52

Eric Postpischil · Accepted Answer · 2018-03-02T15:15:25.297

First, let’s look at the parts of the encoded numbers, which I will label a (15032400896) and b (256.000152587890625):

a: 0 10100000 11000000000000000001111
b: 0 10000111 00000000000000000000101

Both sign bits are 0, indicating the numbers are positive. The exponent field of a is 10100000, which is 160. The encoded exponent is biased by 127, so the actual exponent is 160−127 = 33. (I assume the IEEE 754 basic 32-bit binary format is used.) The exponent field of b is 10000111, which is 135, so its actual exponent is 8.

These are in the normal range of floating-point (Because the encoded exponents are not zero. When the exponent is zero, the number is subnormal.) In the normal range, there is an implicit “1.” prefixed to the significand. (The significand is the fraction part of the number. Sometimes it is called a “mantissa,” but that is a legacy term from the days of paper tables of logarithms. “Significand” is the preferred term.)

The significand field of the first number is 11000000000000000001111, so the actual significand is 1.11000000000000000001111 (as a binary numeral). The significand field of the second number is 00000000000000000000101, so its actual significand is 1.00000000000000000000101.

Now we have fully decoded the numbers and can see their mathematical values are:

a = 1.11000000000000000001111 • 2³³
b = 1.00000000000000000000101 • 2⁸

The question is what happens when the sum of a and 24*b is calculated, so first we need to find 24*b. Since 24 is a simple number, I will skip showing its full floating point representation and simply multiply b by 24. We can do this simply by multiplying its significand by 24, which produces:

24*b = 11000.0000000000000000111 1 • 2⁸

I marked the first 24 bits in bold and put a space between them and the remaining bit. That is because the floating-point format only has 24 bits in the significand. So the computer must round the exact mathematical result to fit in 24 bits. We could round down, to 11000.0000000000000000111, or up, to 11000.0000000000000001000. Since the remaining bit is equidistan between these, we have a tie. The most common rounding rule used in floating-point is to round to the nearest represent value and, in case of a tie, to round to the even digit. So we round up, and the result is:

24*b → 11000.0000000000000001000 • 2⁸

Next, we want to normalize the representation so the significand starts with “1.” instead of “11000.” To do this, we adjust the exponent:

24*b → 1.10000000000000000001000 • 2¹²

I will call this result c. Now we want to add a and c, which are:

a = 1.11000000000000000001111 • 2³³
c = 1.10000000000000000001000 • 2¹²

When the processor adds numbers, it effectively shifts the significands to align bits that represent the same magnitude. Aligning these numbers produces:

1.11000000000000000001111000000000000000000000 • 2³³
0.00000000000000000000110000000000000000001000 • 2³³

Then we can add the numbers, which yields:

1.11000000000000000010101000000000000000001000 • 2³³

Using bold and a space to mark the first 24 bits shows:

1.11000000000000000010101 000000000000000001000 • 2³³

This time, the remaining bits are below the midpoint, so we round down, and the result is:

1.11000000000000000010101 • 2³³

This shows the final result of computing a + 24*b in 32-bit floating-point. Rounding has occurred, but I do not see how it can be described as “round down error in last 3 bits.” If the result had been computed with exact mathematics, it would be:

1.110000000000000000101010000000000000000001111000 • 2³³

So we can see the computed result is correct in its last bits, and the rounding error that has occurred is considerably far down in value.

Multiplying by 24 (3*8 =multiply by 3 and shift the exponent by 3) would require a 25 bit significand since the LSB is 1...
1.100000000000000000001111 • 2¹²
The first 2 bits are above the LSB of the big number which is
2¹⁰
and will be accounted for in the result. But the 3 last trailing bits of the original smallest number are way too far. That's as if they were truncated before the operation (but I don't like to see it that way, I prefer the view of Eric) — aka.nice, Mar 01 '18 at 22:05
When I first saw the question, it was not clear to me what OP meant. But now I think they may have been referring to adding `1*a` to `24*b`, where `a` and `b` are the two numbers. I am planning to update the answer showing that addition when I have some time. — Eric Postpischil, Mar 01 '18 at 22:18

Rounding down float

1 Answers1