Floating point addition / subtraction

Question

I get confused because of the hidden bit in the mantissa.

From what i know:

Subtract the two exponent, find the smaller number and shift the mantissa with the hidden bit (?) by the result of the subtraction.
Calculate the result sign.
Add or subtract the two mantissas with the hidden bit (?) based on the sign bit of the two operands (1,1 or 0,0 ADD) (1,0 or 0,1 SUB). The result will be 25 bits because it must accomodate the carry (?).
If we performed an addition:
- If the 24'th bit (starting from 0) of the result is 1 (there is a carry), shift right the mantissa by 1 and increment the exponent.
- Else do nothing.
If we performed a subtraction.
- Check the number of leading zeros (consider also the hidden bit (?)) and shift the mantissa left by that number, also subtract the exponent by that number.
Result.

Is that right?

(a) There is no question in your post. Ask a specific question. — Eric Postpischil, Sep 03 '21 at 23:37
(b) When working with floating-point **values**, do not think about hidden bits. The format of a floating-point representation is ±f•b^e, where b is the fixed base, + or − is the sign, f is the significand, and e is the exponent. Do arithmetic with this representation. The “hidden” bit, the biased exponent, and the sign bit are part of the **encoding** of the floating-point number. They are just techniques we use to represent the number in memory or a register. They are not the actual value. Do arithmetic with the value. — Eric Postpischil, Sep 03 '21 at 23:40
If you have an encoding, decode it into the value representation first, then do arithmetic. Then, if desired, encode the resulting value into an encoding. — Eric Postpischil, Sep 03 '21 at 23:40
any non zero floating number in binary starts with `1` what the hidden bit is all about is that in actual encoding of mantissa the first bit is not stored (so there is one more bit for mantissa available to improve precision). So once you decode the number for low level computations and stuff you need to add the not stored leading `1` to the mantissa. As Eric mentioned you do not need to bother with this unless you implementing FPU stuff your self ... see [my 32bit float print using only integers](https://stackoverflow.com/a/59861545/2521214) for some ideas and confusion :) — Spektre, Sep 04 '21 at 07:48
@Spektre yes i'm actually implementing a floating point adder in hardware, from what i understanded, i decode the number adding the hidden bit (now the number is 33 bits long) and then do all the math and stuff with this rapresentation right? — Gabbed, Sep 04 '21 at 09:35
@Gabbed yes ... however during computation of operations the mantisa is usually stored in more bits than just 24 in order to have less rounding errors especially for `+,-` operations as you need to align both operands to the same exponent. after operation is done then the result is truncated back to 24 bit mantissa — Spektre, Sep 04 '21 at 13:17

Floating point addition / subtraction

0 Answers0