performing floating point addition algorithmically

Question

I'm trying to understand the algorithm for floating point addition. In the past I've only had to do this on paper, and did it by converting it to decimal and back again. I am writing a floating point ALU in an HDL so that won't work in this case. I've read a lot of the questions on the topic, the most useful of which I've used for this example, and read many articles, but some concepts elude me. I've written the questions in context below, but for summary here they are at top:

When is the implicit bit in the mantissa 0, and when is it 1?
After the addition, how do we algorithmically check for normalization, and then determine which way to shift?
If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not?

Borrowing from this example:

00001000111100110110010010011100 (1.46487e-33)
00000000000011000111111010000100 (1.14741e-39)

First split them into their components (sign, exp, mantissa)

0 00010001 11100110110010010011100
0 00000000 00011000111111010000100

Next tack on their implicit integer value

0 00010001 1.11100110110010010011100
0 00000000 0.00011000111111010000100

Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero

Next subtract the greater exponent from the lesser and shift the lesser mantissa right by that amount

  00010001
- 00000000
___________
00010001 = 17

0.00000000000000000000110

Add the mantissas

   0.00000000000000000000110
+  1.11100110110010010011100
______________________________
   1.11100110110010010100010

Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place?

Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?

Note: "If one of the numbers is negative" includes -0.0, depending on the precise usage of "is negative". +0.0, -0.0, +inf, -inf and NAN all involve special handling. Do not forget about them. — chux - Reinstate Monica, Nov 11 '21 at 06:26

chux - Reinstate Monica · Accepted Answer · 2021-11-11T06:34:30.810

When is the implicit bit in the mantissa 0, and when is it 1?

When the (biased) exponent is at the minimum value (e.g. 0), the implicit bit is 0.
When the (biased) exponent is at the maximum value, there is no implicit bit. The value is infinity or NAN.
Otherwise the implicit bit is 1.

After the addition, how do we algorithmically check for normalization, and then determine which way to shift?

With addition (of 2 operands with the same sign), if there is a carry out of the most-significant place of the sum, shift right, increment the exponent. Check for exponent overflow.

With addition (of 2 operands with the opposite sign) - which is effectively subtraction, if all significant bits zero, return zero. Else if the most-significant place is zero, repeatedly shift left as needed, decrementing the exponent except do not decrement the exponent lower than minimum.

If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not?

No. Common FP encoding is sign-magnitude.

Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero

Yes, the biased exponent is at minimum.

Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place?

The MSBit might be zero when the sign differ (or both operands are 0.0). If the sum is not zero shift left as described above.

Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?

2's compliment is not used. When the signs are the same, add magnitudes. When signs differ, flip the 2nd one's sign bit and call your subtraction code.

IEEE-754 does not use "mantissa" but significand per wiki. I thought it was significant per spec. I'll review later.

Spektre · Answer 2 · 2021-11-11T09:58:46.140

Answer1

see small C++/VCL example of disecting the 32 and 64 bit floats on how to deal with normalized/denormalized and zero/inf/nan states of floats... The state is defined as combination of exponent and mantissa value.

Answer2

No you do not shift so 1 gets on first place before decimal point. Instead you shift by difference of exponents between operands. Also the ALU operations on mantissas are usually done on bigger mantissa bitwidth to lower rounding errors ... only the result is truncated to original mantissa bitwidth after normalization.

Answer3

Yes you can also use 2'os complement so for c=a+b you can do it for example like this C++ (using already dissected parts of operands):

// disect a,b to its compounds and add implicit 1 to mantissas if needed
// a = (-1)^a.sig * a.man * 2^a.exp;
// b = (-1)^b.sig * b.man * 2^b.exp;

// here you should handle special cases when operands are (+/-)inf or nan

if (a.sig) a.man=-a.man; // convert mantisas to 2'o complement 
if (b.sig) b.man=-b.man;

sh=a.exp-b.exp; // exponent difference
if (abs(a.man)>=abs(b.man)) // shift the abs smaller operand to avoid additional rounding
   {
   b.man>>=sh;
   c.exp=a.exp;
   }
else
   {
   a.man<<=sh;
   c.exp=b.exp;
   }
c.man=a.man+b.man; // 2'os complement addition
c.sig=0;
if (c.man<0){ c.sig=1; c.man=-c.man; } // convert back to unsigned mantisa

// here you should normalize the c.exp,c.man and remove implicit 1 from mantisa
// and reconstruct result float
// c = (-1)^c.sig * c.man * 2^c.exp;

You can do this also on unsigned ALU however you need to sort operands by sign and abs value which is much more work...

performing floating point addition algorithmically

2 Answers2

Linked