How to add two floating point numbers with opposite sign?

Question

For fun and to figure out more about how floats work, I'm trying to make a function that takes two single precision floats, and adds them together.

What I've made so far works perfectly for same sign numbers, but it falls apart when the numbers have opposite signs. I've looked over a number of questions and sites (UAF, How do you add 8-bit floating point with different signs, ICL, Adding 32 bit floating point numbers., How to add and subtract 16 bit floating point half precision numbers?, How to subtract IEEE 754 numbers?), but the ones that bring up subtraction mostly describe it somewhat like "basically the same but subtract instead" which I have not found extremely helpful. UAF does say

Negative mantissas are handled by first converting to 2's complement and then performing the addition. After the addition is performed, the result is converted back to sign-magnitude form.

But it doesn't seem that I know how to do that. I found this and this which explained what signed magnitude is and how to convert between it and two's complement so I tried converting like this:

manz = manx + ( ( (many | 0x01000000) ^ 0x007FFFFF) + 1);

and like this:

manz = manx + ( ( (many | 0x01000000) ^ 0x007FFFFF) + 1);
manz = ( ((manz - 1) ^ 0x007FFFFF) & 0xFEFFFFFF);

But neither of those worked.

Trying the method of subtraction described by the other sources, I tried negating the mantissa of the negative numbers in various ways like these:

manz = manx - many;
manz = manx + (many - (1<<23));
manz = manx + (many - (1<<24));
manz = manx + ( (many - (1<<23)) & 0x007FFFFF );
manz = manx + ( (many - (1<<23)) + 1);
manz = manx + ( (~many & 0x007FFFFF) + 1);
manz = manx + (~many + 1);
manz = manx + ( (many ^ 0x007FFFFF) + 1);
manz = manx + ( (many ^ 0x00FFFFFF) + 1);
manz = manx + ( (many ^ 0x003FFFFF) + 1);

This is the statement that is supposed to handle the addition based on the sign, it is after the mantissas have been aligned:

expz = expy;
if(signx != signy) { // opp sign
  if(manx < many) {
    signz = signy;
    manz = many + ((manx ^ 0x007FFFFF) + 1);
  } else if(manx > many) {
    signz = signx;
    manz = manx - ((many ^ 0x007FFFFF) + 1);
  } else { // x == y
    signz = 0x00000000;
    expz  = 0x00000000;
    manz  = 0x00000000;
  }
} else {
  signz = signx;
  manz  = manx + many;
}

This is the code immediately following it which normalizes the number in the case of an overflow, it works when they have the same sign, but I'm not sure the way it works makes sense when subtracting:

if(manz & 0x01000000) {
  expz++;
  manz = (manz >> 1) + (manz & 0x1);
}
manz &= 0x007FFFFF;

With the test values -3.34632F and 34.8532413F, I get the answer 0x427E0716 (63.506920) when it should be 0x41FC0E2D (31.506922), and with the test values 3.34632F and -34.8532413F, I get the answer 0xC27E0716 (-63.506920) when it should be 0xC1FC0E2D (-31.506922).

I was able to fix my problem by changing the way that I was normalizing the floats when subtracting.

expz = expy;
if(signx != signy) { // opp sign
  if(manx < many) {
    signz = signy;
    manz  = many - manx;
  } else if(manx > many) {
    signz = signx;
    manz  = manx - many;
  } else { // x == y
    signz = 0x00000000;
    expz  = 0x00000000;
    manz  = 0x00000000;
  }
  // Normalize subtraction
  while((manz & 0x00800000) == 0 && manz) {
      manz <<= 1;
      expz--;
  }
} else {
  signz = signx;
  manz  = manx + many;
  // Normalize addition
  if(manz & 0x01000000) {
    expz++;
    manz = (manz >> 1) + ( (x & 0x2) ? (x & 0x1) : 0 ); // round even
  }
}
manz &= 0x007FFFFF;

I'm sorry, but I really don't see the problem. If you have two floats `a` and `b`. Why not just use `a+b` to add them? — klutt, Oct 14 '19 at 22:59
because that doesn't show _how_ the addition works, I want to figure out precisely how the addition is performed, rather than just to add two floats. — sheep44, Oct 14 '19 at 23:02
@klutt because the task is to do the sum by dismantling the two `float`s and creating a new one, by understanding the storage format. — Weather Vane, Oct 14 '19 at 23:02
Maybe this can help? https://stackoverflow.com/q/12146443/6699433 — klutt, Oct 14 '19 at 23:05
That's unfortunately no more than I already understand; as noted in the question most of the sources I found say that subtraction is like addition but subtracting - which is basically what that says there - my problem is actually performing the subtraction, how does the computer accomplish it? — sheep44, Oct 14 '19 at 23:12
I found some videos on youtube when searching for "two complements float subtraction" — klutt, Oct 14 '19 at 23:57

Brendan · Accepted Answer · 2019-10-15T01:25:43.993

How to add two floating point numbers with opposite sign?

Mostly you don't.

For everything that works with numerical types that can't rely on "twos complement wrap on overflow" (e.g. floating point, big number libraries, ...) you always end up with something like:

add_signed(v1, v2) {
    if( v1 < 0) {
        if( v2 < 0) {
            // Both negative
            return -add_unsigned(-v1, -v2);
        } else {
            // Different sign, v1 is negative
            return subtract_unsigned(v2, -v1);
        }
    } else {
        if( v2 < 0) {
            // Different sign, v2 is negative
            return subtract_unsigned(v1, -v2);
        } else {
            // Both positive
            return add_unsigned(v1, v2);
        }
    }
 }

subtract_signed(v1, v2) {
    return add_signed(v1, -v2);
}

add_unsigned(v1, v2) {
    // Here we know that v1 and v2 will never be negative, and
    //   we know that the result will never be negative
    ...
}

subtract_unsigned(v1, v2) {
    if(v1 < v2) {
        return -subtract_unsigned(v2, v1);
    }
    // Here we know that v1 and v2 will never be negative, and
    //   we know that the result will never be negative
    ...
}

In other words; all of the actual addition and all of the actual subtraction happens with unsigned ("never negative") numbers.

More complete example for addition of 32-bit floating point emulation only (in C, untested and probably buggy, might or might not work for denormals, no support for "NaN/s" or infinities, no support for overflow or underflow, no "shift mantissa left to reduce precision loss before rounding", and no support for different rounding modes than "round towards zero"):

#define SIGN_FLAG      0x80000000U
#define EXPONENT_MASK  0x7F800000U
#define MANTISSA_MASK  0x007FFFFFU
#define IMPLIED_BIT    0x00800000U
#define OVERFLOW_BIT   0x01000000U
#define EXPONENT_ONE   0x00800000U

uint32_t add_signed(uint32_t v1, uint32_t v2) {
    if( (v1 & SIGN_FLAG) != 0) {
        if( (v2 & SIGN_FLAG) != 0) {
            // Both negative
            return SIGN_FLAG | add_unsigned(v1 & ~SIGN_FLAG, v2 & ~SIGN_FLAG);
        } else {
            // Different sign, v1 is negative
            return subtract_unsigned(v2, v1 & ~SIGN_FLAG);
        }
    } else {
        if( (v2 & SIGN_FLAG) != 0) {
            // Different sign, v2 is negative
            return subtract_unsigned(v1, v2 & ~SIGN_FLAG);
        } else {
            // Both positive
            return add_unsigned(v1, v2);
        }
    }
 }

uint32_t subtract_signed(uint32_t v1, uint32_t v2) {
    return add_signed(v1, v2 ^ SIGN_FLAG);
}

uint32_t add_unsigned(uint32_t v1, uint32_t v2) {
    // Here we know that v1 and v2 will never be negative, and
    //   we know that the result will never be negative

    if(v1 < v2) {    // WARNING: Compares both exponents and mantissas
        return add_unsigned(v2, v1);
    }

    // Here we know the exponent of v1 is not smaller than the exponent of v2

    uint32_t m1 = (v1 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t m2 = (v2 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t exp2 = v2 & EXPONENT_MASK;
    uint32_t expr = v1 & EXPONENT_MASK;

    while(exp2 < expr) {
        m2 >>= 1;
        exp2 += EXPONENT_ONE;
    }
    uint32_t mr = m1+m2;
    if( (mr & OVERFLOW_BIT) != 0) {
        mr >> 1;
        expr += EXPONENT_ONE;
    }
    return expr | (mr & ~IMPLIED_BIT);
}

uint32_t subtract_unsigned(uint32_t v1, uint32_t v2) {
    if(v1 == v2) {
        return 0;
    }
    if(v1 < v2) {
        return SIGN_FLAG ^ subtract_unsigned(v2, v1);
    }

    // Here we know the exponent of v1 is not smaller than the exponent of v2,
    //  and that (if exponents are equal) the mantissa of v1 is larger
    //  than the mantissa of v2; and therefore the result will be
    //  positive

    uint32_t m1 = (v1 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t m2 = (v2 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t exp2 = v2 & EXPONENT_MASK;
    uint32_t expr = v1 & EXPONENT_MASK;

    while(exp2 < expr) {
        m2 >>= 1;
        exp2 += EXPONENT_ONE;
    }
    uint32_t mr = m1-m2;
    while( (mr & IMPLIED_BIT) == 0) {
        mr <<= 1;
        expr -= EXPONENT_ONE;
    }
    return expr | (mr & ~IMPLIED_BIT);
}

I'm not really sure what you mean by this, I don't think you mean to use "unsigned floats" or something like that, but I'm unsure how this would be applied to my particular problem. — sheep44, Oct 14 '19 at 23:17
I mean that; for `add_unsigned(v1, v2)` the values for the result, `v1` and `v2` are all never negative; and for `sub_unsigned(v1,v2)` the values for `v1` and `v2` are never negative (but the result may be negative if `v1` is smaller than `v2`, which is a special case you'd test for in the function). — Brendan, Oct 14 '19 at 23:24
Updated the example code to show the logic (and show that, where it really matters, `v1`, `v2` and the result are never negative). — Brendan, Oct 14 '19 at 23:27
What I'm doing in the code in question is if the floats have opposite signs, then make the resulting sign the same as the float with the larger magnitude, and then subtract the mantissa of the float with smaller magnitude from the mantissa of the one with larger magnitude, but I have been unable to figure out precisely how to actually subtract them. — sheep44, Oct 14 '19 at 23:29
For floating point; after discarding/ignoring sign flags, making sure you're subtracting a smaller value from a larger value and making sure exponents match, it's just a direct subtraction of the mantissas (very similar to the "direct addition of mantissas" you're already doing for other cases). — Brendan, Oct 14 '19 at 23:32
This is what I've seen, but it doesn't seem to work, I can only assume that somewhere in my code I'm doing it wrong; I've noticed that I was checking the sign instead of which of the mantissas was greater (part of the problem), however simply subtracting still doesn't appear work, and neither does `manx + ((many ^ 0x007FFFFF) + 1)` — sheep44, Oct 14 '19 at 23:40
Just a guess; but... For the comparison needed to determine if you're subtracting a small value from a larger value (the `if(v1 < v2)` in my example code); are you using the exponents (or just the matissas)? For floating point it would need to be more like `if( ( v1.exponent < v2.exponent) || ( (v1.exponent == v2.exponent) && (v1.mantissa < v2.mantissa) ) ) {`. — Brendan, Oct 15 '19 at 00:18
the smaller exponent is incremented to equal the larger and the mantissa is right shifted proportionately — sheep44, Oct 15 '19 at 00:28
Too hard to guess the problem from snippets; so I added a more complete example for 32-bit floating point emulation. — Brendan, Oct 15 '19 at 01:16
My problem was that I wasn't re-normalizing properly after the subtraction — sheep44, Oct 15 '19 at 20:33

How to add two floating point numbers with opposite sign?

1 Answers1