Subtraction in IEEE754 format

Question

I am trying to perform the subtraction with the binary representation of fp32.

I have seen this questions in order to perform the subtraction:

I understand that subtraction in IEE754 is similar to addition until the step to adding/subtract the mantissas. The steps to the algorithm addition/subtraction if I understand correctly are:

Get the biggest number.
Subtract the biggest exponent with the smallest and take the biggest exponent to the result.
Shift the mantissa from the smallest operand to the right until the exponents will be aligned.
Now, if the signs of the operands are equal (+,+ or -,-), then add the mantissas. If the signs are distinct (+,- or -,+), then subtract the mantissa of the biggest operand (in magnitude) with the shifted mantissa (I think that I have the mistaken here...)
When addition: if the addition of mantissas overflow, add one to the exponent and shift right the resultant mantissa. When subtraction: I don't understand that I need to do here

My doubt is: If the input is, for example: 100.3654, -100.254, the resultant exponent must be smaller than the exponents of the inputs, because 100.3654-100.254 = 0.1114 and the exponent of 0.1114 is 123, but the exponents of 100.3654 and -100.254 are 133, so, when I apply the step to subtract the exponents to get the shift, the resultant shift is zero... I have achieved perform correctly subtraction in IEEE754 but only when the operands are not closer. For example if the first operand is 100 and the second operand is -1.95. In this case, the resultant exponent is 133 too.

This is the code that I am using to check the algorithm (Python). I have also implemented it in Verilog and works well if the operands has the same sign. No rounding mode is set because I don't need it in my case.

def ieee754_addition(a, b):

    # Get binary representation for input a
    a_bin  = ieee754_bin(a)
    a_exp  = int(a_bin['exp'], 2)
    a_mnts = int("1"+a_bin['mnts'], 2)
    
    # Get binary representation for input b
    b_bin  = ieee754_bin(b)
    b_exp  = int(b_bin['exp'], 2)
    b_mnts = int("1"+b_bin['mnts'], 2)
    
    # We suppose that operand a is greater than operand b
    exp = a_exp
    mnts = a_mnts
    exp_m = b_exp
    mnts_m = b_mnts
    
    # If operand b is greater than operand b
    if b_exp >= a_exp:
        mnts = b_mnts
        exp = b_exp
        exp_m = a_exp
        mnts_m = a_mnts
    
    # How many shifts are needed to normalize the smaller mantissa
    shift = int(exp - exp_m)
    
    # Shift the mantissa
    mnts_m_shift = mnts_m >> shift


    # If the signs are distincts, perform mantissa's subtraction
    if ((a_bin['sign'] == '1' and b_bin['sign'] == '0') or (a_bin['sign'] == '0' and b_bin['sign'] == '1')):
        if mnts > mnts_m_shift:
            mnts = (mnts - mnts_m_shift)
        else:
            mnts = (mnts_m_shift - mnts)
    # If signs are equals
    else:
        # Adding the mantissas
        mnts = (mnts + mnts_m_shift)
    
    # Get the sign of the greater operand
    if (abs(a) > abs(b)) :
        sign = a_bin['sign']
    else:
        sign = b_bin['sign']
    
    # If signs are equal
    if (a_bin['sign'] == b_bin['sign']):
        sign = a_bin['sign']
    
    msign = 1
    if sign == '1':
        msign = -1
    
    # If overflow when the mantissas has been added
    nrm_bit = int("{:025b}".format(mnts)[0],2)
    
    # Shift left the mantissa
    mnts_norm = mnts >> nrm_bit
    mnts_bin = "{:024b}".format(mnts_norm)
    mnts_ = mnts_bin[1:24]
    
    # Adding one to the exponent if mantissa result overflow
    exp_bin = "{:08b}".format(exp + nrm_bit)
    
    # Concatenate exponent and mantissa result
    result = "0" + exp_bin + mnts_
    
    # Return the result in fp32 format
    return msign*bin_fp32(result)

# Seed to set always the same random values
np.random.seed(5)

# Random values
samples = 65536
a = np.random.random(samples) * 100
b = np.random.random(samples) * 100

# Performing the test
errors = np.zeros(len(a), dtype=np.float32)
results_teor = np.zeros(len(a), dtype=np.float32)
results_prct = np.zeros(len(a), dtype=np.float32)
for i in range(len(a)):
    result  = ieee754_addition(a[i], b[i])
    teor = (a[i] + b[i]).astype(np.float32)
    errors[i] = abs(result - teor)
    results_teor[i] = teor
    results_prct[i] = result

print(np.max(errors))

Re "When subtraction: I don't understand that I need to do here": Leading bits may cancel when smaller significand is subtracted from larger significand. If so, the significand of the result must be re-normalized by a left shift. — njuffa, Mar 29 '21 at 08:27
Hi @njuffa I am going to check it. Thanks! One question: when this condition occurs, then the exponent also must be normalized, yeah? In my case I am watching that my result is incremented by 2x, 4x... that it is the same that subtract it 1, 2... to divide it by pows of two. — Diego Ruiz, Mar 29 '21 at 08:52
Correct, whenever significand is shifted right, increase exponent; whenever significand is shifted left, decrease exponent. — njuffa, Mar 29 '21 at 09:18
Hi @njuffa I think that I have solved it, but I have perceived that is neccesary a *find first zero* in order to get the number of shift lefts when a subtraction is performed... Is there any trick to avoid it? I have implemented this *CLZ* https://www.researchgate.net/publication/284919835_Modular_Design_Of_Fast_Leading_Zeros_Counting_Circuit that has only a few hardware resources, about 25 ALM's, so I think I will can modify it to use it as *ffz*. I am suprpised by the difference between perform an addition (when signs are equals I want to say) and a subtraction. — Diego Ruiz, Mar 29 '21 at 10:33
I have made a mistake, I wanted to say *find first one* instead of *find first zero*. — Diego Ruiz, Mar 29 '21 at 13:46
Yes, count leading zeros is the operation needed to find out how many bits to shift left. One can also do it iteratively, by shifting left one bit at a time until the leading 1-bit is in the desired bit position. Note that cancellation of leading bits of the significand, subsequently requiring re-normalization, can only happen if the exponents of the two source operands differ by at most one. — njuffa, Mar 29 '21 at 16:53
Hi @njuffa FInally I have used the *CLZ* from the paper above mentioned and works done, but I have modified it for only 24 bits word length instead of 32. Make iteratively a *CLZ* in Verilog could be expensive: Is needed unroll the loop if used it or other way can be using a *State Machine *, but I think thtat with a state machine the max frequency will be low. I have a question about C/C++: is it possible manage unitary bits or words of any bit length? Such as the Verilog/VHDL `wire` data type. Always that I work with bits in C/C++ I use bit masking to extract the bits... — Diego Ruiz, Mar 30 '21 at 14:46
The circuit you need to renormalize after effective subtraction is a priority encoder (look for a library building block for that). Typically needs log2(#bits) layers. This is for a naive design. Read specific papers on *modern* FP adder design (e.g. "near path"/"far path" style). C/C++ have various fixed-width types, but in hardware emulation and synthesizable C\C++ it is common to define various custom types for specific bit-width like `U3` which are mapped to the next larger C/C++ type, e.g. `U3` -> `uint8_t`. Note: My HW-design experience reflects the state of 20 years ago. — njuffa, Mar 30 '21 at 20:06
Yes, the circuit shown in the paper its composed of priority encoders, only is needed invert some signals in order to find the first one, zero... etc. and there is no penalty on hardware resources for it. The implementation of the adder/subtract is https://github.com/dieruiqu/OpenCL-Basic-Functions if you want to look it. I think that the better way for me about manage bits for testing purposes in C/C++ will be create my own library with the custom data types as you explain... Thanks for all @njuffa !! — Diego Ruiz, Mar 31 '21 at 10:34

Subtraction in IEEE754 format

0 Answers0