I am trying to perform the subtraction with the binary representation of fp32.
I have seen this questions in order to perform the subtraction:
- How to subtract IEEE 754 numbers?
- Implementing floating point subtraction
- https://www.cs.uaf.edu/2000/fall/cs301/notes/notes/node51.html
I understand that subtraction in IEE754 is similar to addition until the step to adding/subtract the mantissas. The steps to the algorithm addition/subtraction if I understand correctly are:
- Get the biggest number.
- Subtract the biggest exponent with the smallest and take the biggest exponent to the result.
- Shift the mantissa from the smallest operand to the right until the exponents will be aligned.
- Now, if the signs of the operands are equal (+,+ or -,-), then add the mantissas. If the signs are distinct (+,- or -,+), then subtract the mantissa of the biggest operand (in magnitude) with the shifted mantissa (I think that I have the mistaken here...)
- When addition: if the addition of mantissas overflow, add one to the exponent and shift right the resultant mantissa. When subtraction: I don't understand that I need to do here
My doubt is: If the input is, for example: 100.3654, -100.254, the resultant exponent must be smaller than the exponents of the inputs, because 100.3654-100.254 = 0.1114 and the exponent of 0.1114 is 123, but the exponents of 100.3654 and -100.254 are 133, so, when I apply the step to subtract the exponents to get the shift, the resultant shift is zero... I have achieved perform correctly subtraction in IEEE754 but only when the operands are not closer. For example if the first operand is 100 and the second operand is -1.95. In this case, the resultant exponent is 133 too.
This is the code that I am using to check the algorithm (Python). I have also implemented it in Verilog and works well if the operands has the same sign. No rounding mode is set because I don't need it in my case.
def ieee754_addition(a, b):
# Get binary representation for input a
a_bin = ieee754_bin(a)
a_exp = int(a_bin['exp'], 2)
a_mnts = int("1"+a_bin['mnts'], 2)
# Get binary representation for input b
b_bin = ieee754_bin(b)
b_exp = int(b_bin['exp'], 2)
b_mnts = int("1"+b_bin['mnts'], 2)
# We suppose that operand a is greater than operand b
exp = a_exp
mnts = a_mnts
exp_m = b_exp
mnts_m = b_mnts
# If operand b is greater than operand b
if b_exp >= a_exp:
mnts = b_mnts
exp = b_exp
exp_m = a_exp
mnts_m = a_mnts
# How many shifts are needed to normalize the smaller mantissa
shift = int(exp - exp_m)
# Shift the mantissa
mnts_m_shift = mnts_m >> shift
# If the signs are distincts, perform mantissa's subtraction
if ((a_bin['sign'] == '1' and b_bin['sign'] == '0') or (a_bin['sign'] == '0' and b_bin['sign'] == '1')):
if mnts > mnts_m_shift:
mnts = (mnts - mnts_m_shift)
else:
mnts = (mnts_m_shift - mnts)
# If signs are equals
else:
# Adding the mantissas
mnts = (mnts + mnts_m_shift)
# Get the sign of the greater operand
if (abs(a) > abs(b)) :
sign = a_bin['sign']
else:
sign = b_bin['sign']
# If signs are equal
if (a_bin['sign'] == b_bin['sign']):
sign = a_bin['sign']
msign = 1
if sign == '1':
msign = -1
# If overflow when the mantissas has been added
nrm_bit = int("{:025b}".format(mnts)[0],2)
# Shift left the mantissa
mnts_norm = mnts >> nrm_bit
mnts_bin = "{:024b}".format(mnts_norm)
mnts_ = mnts_bin[1:24]
# Adding one to the exponent if mantissa result overflow
exp_bin = "{:08b}".format(exp + nrm_bit)
# Concatenate exponent and mantissa result
result = "0" + exp_bin + mnts_
# Return the result in fp32 format
return msign*bin_fp32(result)
# Seed to set always the same random values
np.random.seed(5)
# Random values
samples = 65536
a = np.random.random(samples) * 100
b = np.random.random(samples) * 100
# Performing the test
errors = np.zeros(len(a), dtype=np.float32)
results_teor = np.zeros(len(a), dtype=np.float32)
results_prct = np.zeros(len(a), dtype=np.float32)
for i in range(len(a)):
result = ieee754_addition(a[i], b[i])
teor = (a[i] + b[i]).astype(np.float32)
errors[i] = abs(result - teor)
results_teor[i] = teor
results_prct[i] = result
print(np.max(errors))