Explain the steps of subtracting the floating-point numbers 1.0 - 1.0

Question

I was working on paper subtracting the floating-point numbers 1.0 - 1.0.

I aligned the exponents, then subtracted the mantissa. Which is zero and my result comes out to 1.0. Obviously the result is supposed to be 0. How is that possible if I can't normalize a mantissa of 0, there are no bits to shift?

Exponents are aligned: 0x7F, 0x7F
Subtracting both mantissas:

0x800000 - 0x800000 = 0

Normalize results:

Mantissa is 0x0, and there are no bits in mantissa to shift left or right to normalize.

Results

Exponent: 0x7F Mantissa: 0x0

1.0

Please show the details of how you did that. Only then any potential mistake can be pointed out to you. — Yunnosch, Jun 13 '23 at 20:38
Something is off there. I assume that you give `1.0` as result, because you have code ending on that result; because if the mantissa is 0 then the value is 0, by my understanding. Please show that code - or explain how from mantissa 0 you yourself come up with the final 1.0. — Yunnosch, Jun 13 '23 at 20:51
I am working this on paper. If the mantissa is zero that just means everything past the decimal point is zero, but the exponent is still 0x7F which is where `1.0` comes from. I am under the assumption that if I normalized the mantissa, then I would have to decrement the exponenet but how can I do that if the mantissa is 0 and has no bits for me to shift. — user3869810, Jun 13 '23 at 20:54
You appear to be using something else to interpret “Exponent: 0x7F Mantissa: 0x0” as a floating-point number, and that something else is giving you 1. Maybe it is some software, maybe it is a book. You should update the question to be clear about this. — Eric Postpischil, Jun 13 '23 at 20:58
I am using the bits interpretation like in this [article](https://digitalsystemdesign.in/floating-point-addition-and-subtraction/). — user3869810, Jun 13 '23 at 21:14
Normalizing results involves adjusting the exponent to corresponding to shifting the significand. When the significand is zero, the normalization process requires an exception since it is not possible to shift the significand to normalize it. You need an algorithm that includes the proper steps for normalization. The article you link to does not seem to have that. — Eric Postpischil, Jun 13 '23 at 21:18
@user3869810 normalization is just bitshifting mantissa left/right and dec/inc exponent so mantissa starts with implicit 1 , denormals are such that the exponent is already at its minimum while mantissa still does not start with 1 ... — Spektre, Jun 15 '23 at 11:21

score 2 · Accepted Answer · answered Jun 13 '23 at 21:06

2

It is awkward to work with floating-point numbers using the bits that represent them instead of using the mathematical form, ±F•b^e, where b is a fixed base (two for binary floating-point), e is an integer in a specified range, and F is a base-b numeral of fixed length and range. F is called the significand.¹

In the mathematical form, 1.0 is +1.000…000•2⁰, and subtracting +1.000…000•2⁰ from +1.000…000•2⁰ yields +0.000…000•2⁰ (the exponent does not matter), and, when we encode that int the IEEE-754 single precision format (binary32), we get the bits 0000…0000 (sign bit 0, exponent field 0000000, significand field 0000…0000).

If you are going to work with the bits directly, you have to develop more details for an algorithm for that. Once you have a result, you must encode it properly. The rules for encoding a binary32 number include:

If the significand begins with 1, the trailing bits (all bits after the 1) are stored in the significand field, and the exponent field is set to 127+e.
If the significand begins with 0, the trailing bits are stored in the significand field, and the exponent field is set to 0.

Thus, since you had a significand of 0, you should have set the exponent field to 0.

Footnote

¹ “Significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic.

answered Jun 13 '23 at 21:06

Eric Postpischil

195,579
13
168
312

Most articles I see [online](https://digitalsystemdesign.in/floating-point-addition-and-subtraction/) gives steps using the bits representation. Can you expand your example to using the bits instead? – user3869810 Jun 13 '23 at 21:14
@user3869810: I would ask him not to since his answer is better the way it is. – President James K. Polk Jun 13 '23 at 22:16
@PresidentJamesK.Polk His answer is good but I would still want to see a method using the bits like I'm doing on paper. – user3869810 Jun 13 '23 at 23:41
yep denormal and zero float representation is the answer +1 , however your `significant field` name confuses I would call it `stored mantissa` or something like that not to confuse with `sign bit` or MSB/LSB stuff... – Spektre Jun 14 '23 at 06:21
@user3869810: Giving a complete description of how to do floating-point subtraction via the bit representation is a large task. You have to check whether either operand is a NaN. If not, you have to check whether either operand is an infinity, and then you have to produce various results depending on which operand, which sign, and whether the other operand is an infinity. Then you have to decode the significands, accounting for subnormals… – Eric Postpischil Jun 14 '23 at 13:53
… Then you have to shift to align the operands, but should be limited for when the disparity between the exponents is so great it no longer matters, because you want to avoid shifting thousands of bits for no benefit. Then you subtract. Then you have to encode the result significand, now limiting the shift to account for denormals. That is a lot of algorithm to write up. It might be worth doing someday, but I am not going to take the time for it now. – Eric Postpischil Jun 14 '23 at 13:54
@EricPostpischil That's fine, I didn't get my exact answer but something is better than nothing. I'll upvote. – user3869810 Jun 14 '23 at 16:44
1

@user3869810 see [performing floating point addition algorithmically](https://stackoverflow.com/a/69925940/2521214) its part of what Eric had in mind and also look at the first link in that answer which contains the NaN/Inf detection and decomposition/reconstruction of float/double to/from compounds (sign,man,exp) – Spektre Jun 15 '23 at 06:58

Explain the steps of subtracting the floating-point numbers 1.0 - 1.0

1 Answers1

Footnote