How to create floating point functions without a coprocessor

Question

I've been trying to create a library for the Motorola 68000 (the original, without a floating point coprocessor) that can convert a binary32 floating point value to text that can be displayed on-screen. I was using this website which converts floats to hex to get a breakdown of the logic. Here's my output on that website with the test value 0x3E400000:

0x3e400000
  3      e       4       0      0       0       0        0
0011   1110    0100    0000    0000    0000    0000    0000
0   01111100    10000000000000000000000
sign    exponent    mantissa
+1  124 1.10000000000000000000000 (binary)
+1 *    2^(124 - 127) * 1.5
+1 *    0.125000000 *   1.5

The thing is, this all seems like circular logic. Without a floating point coprocessor how would you calculate and store 2^-3 or 1.5? Wouldn't each of those need to be run through the same routine as your input, and so on?

This is what I have in my assembly code if it helps. If there are instructions you don't recognize, I've created macros to help me out and I've explained them in the comments. Maybe this will make what I'm trying to do clearer.

Float2Text:
    ;input: d0
    ;clobbers d1
    pushRegs d2-d7  ;MOVEM.L D2-D7,-(SP)
    MOVEQ.L #0,D7
    MOVEQ.L #0,D6
    MOVEQ.L #0,D5
    MOVEQ.L #0,D4
    
    
    CLX             ;ANDI #%00001111,CCR
    MOVE.L D0,D2
    ROXL.L D0
    ROXL.L D7       ;TRANSFER SIGN BIT INTO D7
    
    SWAP D0
    ROR.W #8,D0     ;GET EXPONENT INTO BOTTOM BYTE
    CMP.B #0,D0
    BEQ .isZero
    CMP.B #%11111111,D0
    BEQ .isInfinity
    
    SUB.B #127,D0
    ;THIS IS THE UNBIASED EXPONENT OF 2
    
    ;NOW WE HAVE THE EXPONENT IN D0, AND THE SIGN BIT IN D7
    ;WE NEED TO SUM UP THE MANTISSA BITS.
    MOVE.B D0,D3
    MOVE.L D2,D0
    AND.L #%00000000011111111111111111111111,D0     ;CLEAR SIGN AND EXPONENT
    
    
    
.isZero:

.isInfinity:

.done
    popRegs d2-d7   ;MOVEM.L (SP)+,D2-D7
    RTS

Floating point operations aren't magic; you can use multiple integer instructions to implement them (software floating point). Or in your case, since you know you don't have hardware FP support in the first place, you could use string<->float code that only uses extended-precision integer stuff on the exponent and mantissa in the first place. See also [Why does "dtoa.c" contain so much code?](https://stackoverflow.com/q/3173056) and https://www.exploringbinary.com/quick-and-dirty-floating-point-to-decimal-conversion/. — Peter Cordes, Jun 07 '22 at 20:55
For the opposite direction, glibc's strtod does use only extended-precision integer math to produce a binary64 or binary32 (https://www.exploringbinary.com/how-glibc-strtod-works/), but I'm not sure if there actually is an existing implementation of dtoa or ftoa (binary32 -> decimal ASCII string) that doesn't use any FP operations. In binary it's simple, just put the decimal point (err radix point) at the right place relative to the mantissa bits. But multiplying or dividing by 2 in base 10 changes the base10 digits. — Peter Cordes, Jun 07 '22 at 20:56
Translating floating point values to text with great accuracy and performance is non-trivial, as there is a fundamental mismatch between the base 2 representation and decimal representation, and simplistic solutions loose precision quickly. It is a problem that has been studied repeatedly over the years. For example, have a look at this [article](https://www.cs.tufts.edu/~nr/cs257/archive/florian-loitsch/printf.pdf). (Also, it doesn't necessarily have to be done in assembly -- you might work in C until you have a satisfactory algorithm.) — Erik Eidt, Jun 07 '22 at 22:45
the math itself is pretty easy, just like in grade school for add and subtract you line up the decimal (binary) points. for multiply and divide you just do the math and then add the exponents. converting to base 10 text, yeah thats a PITA...but it is for fixed point as well. — old_timer, Jun 07 '22 at 23:06
exceptions are where it gets painful. quiet and signaling nans, etc... — old_timer, Jun 07 '22 at 23:41
I use wikipedia for the floating point formats, from there it is just elementary school math. and then you normalize to fit the format 1.xxxxx not 10.xxxxx not 0.1xxxx etc — old_timer, Jun 08 '22 at 00:16
@old_timer: NaNs aren't *painful*, just a special encoding you can check for separately. NaNs and +/-Inf are the easiest thing, just check for an all-ones exponent field and whether the exponent field is 0 (Inf) or not (NaN) and output a fixed string. The painful part is dealing with the fact that the value represented can be a huge integer, way larger than 2^32 or even 2^64. (Nearly 2^128 for [binary32](//en.wikipedia.org/wiki/Single-precision_floating-point_format)), and figuring out how many fractional digits to print to round-trip the value back through strtod, out of maybe infinite. — Peter Cordes, Jun 08 '22 at 07:06
@PeterCordes, well they add more rules. It is more than just adding two numbers using fixed point now you get into if this and if exceptions are enabled else if then. then how the nans propogate. It is just a lot more work. If the OP is really interested in a soft fpu for IEEE. If the goal here is just, how do I add a couple of (normal) floating point numbers. That is quite easy. — old_timer, Jun 08 '22 at 13:56
And the version of the spec that an fpu team I was on worked on, compared to what the spec is today, was painful (intel and others could not meet the spec in hardware at the time). I think they made some things better, but I do not have a copy of the current spec. — old_timer, Jun 08 '22 at 13:58

Brendan · Accepted Answer · 2022-06-10T16:35:24.357

For Float2Text:

A floating point number is more accurately described as "base 2 floating point". E.g. the number 1.25 in binary is represented as 1.01b or 101b >> 2.

Your initial goal should be to convert "base 2 floating point" into "base 10 floating point". E.g. for "base 10 floating point" the number 1.25 will be the digits (characters?) 125 with the exponent of -2, like "125 * 10^(-2)". Once you have the number in "base 10 floating point" it becomes relatively easy to print (e.g. just print the digits while inserting the decimal point in the right place).

Start by splitting the "base 2 floating point" number into signed integer "base 2 mantissa" (while taking care of the "implied leading 1" if it's not a sub-normal) and a signed integer "base 2 exponent" (while taking care of the bias).

Next; if the "base 2 exponent" is negative, multiply the original number by 10 while decreasing the "base 10 exponent" to keep track. Note that you can multiply by 10 by multiplying "base 2 mantissa" by 5 and increasing "base 2 exponent". In the same way you can multiply by 100 by multiplying "base 2 mantissa" by 25 and adding 2 to "base 2 exponent"; or multiply by 1000 by multiplying "base 2 mantissa" by 125 and adding 3 to "base 2 exponent", etc. You can speed this up a lot by using the existing "base 2 exponent" to select the right multiplier; e.g. maybe something vaguely like:

    while(base2_exponent < 0) {
        switch(base2_exponent) {
            case -1: 
                base2_mantissa *= 5; base2_exponent -= 1;
                base10_exponent += 1;
                break;
            case -2: 
                base2_mantissa *= 25; base2_exponent -= 2;
                base10_exponent += 2;
                break;
            case -3: 
                base2_mantissa *= 125; base2_exponent -= 3;
                base10_exponent += 3;
                break;
            default: 
                base2_mantissa *= 625; base2_exponent -= 4;
                base10_exponent += 4;
                break;
            }
        }

This will cause a minor problem - for worst case (very small numbers); the "base 2 mantissa" can grow to become about 160 bits. To deal with this you have to use "big integers" (and you can't just do base2_mantissa *= 625; with a normal integer and will need something more like big_integer_multiply(&base2_mantissa, 625);).

Next; if the "base 2 exponent" is positive; multiply "base 2 mantissa" by 2 and decrease "base 2 exponent" (or in other words, shift "base 2 mantissa" left by the "base 2 exponent"). This has a similar problem - for worst case (very large numbers) the "base 2 mantissa" can grow to become about 160 bits.

At this point, "base 2 exponent" will be zero and can be ignored.

The next step is converting "base 2 mantissa" into a "base 10 mantissa". In theory this shouldn't be that hard - maybe something vaguely like while(base2_mantissa > 0) { *dest = base2_mantissa % 10 + '0'; base2_mantissa /= 10; dest--; } except that you'll be using "big integer" (with 160 bit integers) for modulo and division. However; it's better to work from most significant digit to least significant digit. To do this, have a lookup table of powers of 10 (e.g. 1, 10, 100, 1000, 10000, ....), find the largest divisor that's not larger than the mantissa, then do something vaguely like while(base2_mantissa > 0) { divisor = find_largest_divisor_from_table(base2_mantissa); *dest = base2_mantissa / divisor + '0'; base2_mantissa %= divisor; dest++; }. This allows you to quit early - e.g. if you only need a maximum of the 10 significant digits then you can use a fixed size buffer of 10 char and stop when you have 10 digits. Of course all of this (the table of divisors, etc) will need to be using "big integer" too.

The final step is printing "base 10 mantissa" while making sure you insert the decimal point character ('.') where "base 10 exponent" says it should be. If you used "most significant digit to least significant digit" in the previous step then this can be combined with the previous step (e.g. print one digit at a time without storing "base 10 mastissa" at all, while inserting the decimal point in the right place).

For other floating point operations:

Negation is trivial (an XOR of the sign bit).

For everything else, do it in 3 phases - split the source values into "signed integer mantissa" and "signed integer exponent" (mostly as described above); do a "middle phase" that depends on what the operation is; then combine the resulting "signed integer mantissa" and "signed integer exponent" back into the right format.

As a general guide (without complications - overflow, underflow, rounding, NaNs, ...) the "middle phases" are:

for multiplication: result_mantissa = src1_mantissa * src2_mantissa; result_exponent = src1_exponent + src2_exponent - mantissa_size;
for division: result_mantissa = (src1_mantissa << mantissa_size) / src2_mantissa; result_exponent = src1_exponent - src2_exponent;
for addition: find the value with the most positive exponent; shift the other value's mantissa right while increasing its exponent until the exponents match; result_mantissa = src1_mantissa + src2_mantissa; result_exponent = src1_exponent;
for subtraction: negate the second value and use addition instead
for shift left: if the mantissa is too small (sub-normal) shift it left as much as you can (while making sure it doesn't become too large); then add the remaining shift count to the exponent.
for shift right: subtract the shift count from the exponent as much as you can while keeping the exponent from becoming "too negative". If there's any shift count left then shift the mantissa right (forming a sub-normal).

Note that combining the resulting "signed integer mantissa" and "signed integer exponent" back into the right format should include some kind of while(result_mantissa's magnitude is too big) { result_mantissa >> 1; result_exponent++); }; which ends up being a little messy (can/should take into account rounding mode).

How to create floating point functions without a coprocessor

1 Answers1