First, let’s look at the parts of the encoded numbers, which I will label a
(15032400896) and b
(256.000152587890625):
a: 0 10100000 11000000000000000001111
b: 0 10000111 00000000000000000000101
Both sign bits are 0, indicating the numbers are positive. The exponent field of a
is 10100000, which is 160. The encoded exponent is biased by 127, so the actual exponent is 160−127 = 33. (I assume the IEEE 754 basic 32-bit binary format is used.) The exponent field of b
is 10000111, which is 135, so its actual exponent is 8.
These are in the normal range of floating-point (Because the encoded exponents are not zero. When the exponent is zero, the number is subnormal.) In the normal range, there is an implicit “1.” prefixed to the significand. (The significand is the fraction part of the number. Sometimes it is called a “mantissa,” but that is a legacy term from the days of paper tables of logarithms. “Significand” is the preferred term.)
The significand field of the first number is 11000000000000000001111, so the actual significand is 1.11000000000000000001111 (as a binary numeral). The significand field of the second number is 00000000000000000000101, so its actual significand is 1.00000000000000000000101.
Now we have fully decoded the numbers and can see their mathematical values are:
a = 1.11000000000000000001111 • 233
b = 1.00000000000000000000101 • 28
The question is what happens when the sum of a
and 24*b
is calculated, so first we need to find 24*b
. Since 24 is a simple number, I will skip showing its full floating point representation and simply multiply b
by 24. We can do this simply by multiplying its significand by 24, which produces:
24*b = 11000.0000000000000000111 1 • 28
I marked the first 24 bits in bold and put a space between them and the remaining bit. That is because the floating-point format only has 24 bits in the significand. So the computer must round the exact mathematical result to fit in 24 bits. We could round down, to 11000.0000000000000000111, or up, to 11000.0000000000000001000. Since the remaining bit is equidistan between these, we have a tie. The most common rounding rule used in floating-point is to round to the nearest represent value and, in case of a tie, to round to the even digit. So we round up, and the result is:
24*b → 11000.0000000000000001000 • 28
Next, we want to normalize the representation so the significand starts with “1.” instead of “11000.” To do this, we adjust the exponent:
24*b → 1.10000000000000000001000 • 212
I will call this result c
. Now we want to add a
and c
, which are:
a = 1.11000000000000000001111 • 233
c = 1.10000000000000000001000 • 212
When the processor adds numbers, it effectively shifts the significands to align bits that represent the same magnitude. Aligning these numbers produces:
1.11000000000000000001111000000000000000000000 • 233
0.00000000000000000000110000000000000000001000 • 233
Then we can add the numbers, which yields:
1.11000000000000000010101000000000000000001000 • 233
Using bold and a space to mark the first 24 bits shows:
1.11000000000000000010101 000000000000000001000 • 233
This time, the remaining bits are below the midpoint, so we round down, and the result is:
1.11000000000000000010101 • 233
This shows the final result of computing a + 24*b
in 32-bit floating-point. Rounding has occurred, but I do not see how it can be described as “round down error in last 3 bits.” If the result had been computed with exact mathematics, it would be:
1.110000000000000000101010000000000000000001111000 • 233
So we can see the computed result is correct in its last bits, and the rounding error that has occurred is considerably far down in value.