For fun and to figure out more about how floats work, I'm trying to make a function that takes two single precision floats, and adds them together.
What I've made so far works perfectly for same sign numbers, but it falls apart when the numbers have opposite signs. I've looked over a number of questions and sites (UAF, How do you add 8-bit floating point with different signs, ICL, Adding 32 bit floating point numbers., How to add and subtract 16 bit floating point half precision numbers?, How to subtract IEEE 754 numbers?), but the ones that bring up subtraction mostly describe it somewhat like "basically the same but subtract instead" which I have not found extremely helpful. UAF does say
Negative mantissas are handled by first converting to 2's complement and then performing the addition. After the addition is performed, the result is converted back to sign-magnitude form.
But it doesn't seem that I know how to do that. I found this and this which explained what signed magnitude is and how to convert between it and two's complement so I tried converting like this:
manz = manx + ( ( (many | 0x01000000) ^ 0x007FFFFF) + 1);
and like this:
manz = manx + ( ( (many | 0x01000000) ^ 0x007FFFFF) + 1);
manz = ( ((manz - 1) ^ 0x007FFFFF) & 0xFEFFFFFF);
But neither of those worked.
Trying the method of subtraction described by the other sources, I tried negating the mantissa of the negative numbers in various ways like these:
manz = manx - many;
manz = manx + (many - (1<<23));
manz = manx + (many - (1<<24));
manz = manx + ( (many - (1<<23)) & 0x007FFFFF );
manz = manx + ( (many - (1<<23)) + 1);
manz = manx + ( (~many & 0x007FFFFF) + 1);
manz = manx + (~many + 1);
manz = manx + ( (many ^ 0x007FFFFF) + 1);
manz = manx + ( (many ^ 0x00FFFFFF) + 1);
manz = manx + ( (many ^ 0x003FFFFF) + 1);
This is the statement that is supposed to handle the addition based on the sign, it is after the mantissas have been aligned:
expz = expy;
if(signx != signy) { // opp sign
if(manx < many) {
signz = signy;
manz = many + ((manx ^ 0x007FFFFF) + 1);
} else if(manx > many) {
signz = signx;
manz = manx - ((many ^ 0x007FFFFF) + 1);
} else { // x == y
signz = 0x00000000;
expz = 0x00000000;
manz = 0x00000000;
}
} else {
signz = signx;
manz = manx + many;
}
This is the code immediately following it which normalizes the number in the case of an overflow, it works when they have the same sign, but I'm not sure the way it works makes sense when subtracting:
if(manz & 0x01000000) {
expz++;
manz = (manz >> 1) + (manz & 0x1);
}
manz &= 0x007FFFFF;
With the test values -3.34632F
and 34.8532413F
, I get the answer 0x427E0716
(63.506920
) when it should be 0x41FC0E2D
(31.506922
), and with the test values 3.34632F
and -34.8532413F
, I get the answer 0xC27E0716
(-63.506920
) when it should be 0xC1FC0E2D
(-31.506922
).
I was able to fix my problem by changing the way that I was normalizing the floats when subtracting.
expz = expy;
if(signx != signy) { // opp sign
if(manx < many) {
signz = signy;
manz = many - manx;
} else if(manx > many) {
signz = signx;
manz = manx - many;
} else { // x == y
signz = 0x00000000;
expz = 0x00000000;
manz = 0x00000000;
}
// Normalize subtraction
while((manz & 0x00800000) == 0 && manz) {
manz <<= 1;
expz--;
}
} else {
signz = signx;
manz = manx + many;
// Normalize addition
if(manz & 0x01000000) {
expz++;
manz = (manz >> 1) + ( (x & 0x2) ? (x & 0x1) : 0 ); // round even
}
}
manz &= 0x007FFFFF;