What are the approaches to SW floating-point implementation?

Question

I need to be able to use floating-point arithmetic under my dev environment in C (CPU: ~12 MHz Motorola 68000). The standard library is not present, meaning it is a bare-bones C and no - it isn't gcc due to several other issues

I tried getting the SoftFloat library to compile and one other 68k-specific FP library (name of which escapes me at this moment), but their dependencies cannot be resolved for this particular platform - mostly due to libc deficiencies. I spent about 8 hrs trying to overcome the linking issues, until I got to a point where I know I can't get further.

However, it took mere half an hour to come up with and implement the following set of functions that emulate floating-point functionality sufficiently for my needs.

The basic idea is that both fractional and non-fractional part are 16-bit integers, thus there is no bit manipulation. The nonfractional part has a range of [-32767, 32767] and the fractional part [-0.9999, +0.9999] - which gives us 4 digits of precision (good enough for my floating-point needs - albeit wasteful).

It looks to me, like this could be used to make a faster, smaller - just 2 Bytes-big - alternative version of a float with ranges [-99, +99] and [-0.9, +0.9]

The question here is, what other techniques - other than IEEE - are there to make an implementation of basic floating-point functionality (+ - * /) using fixed-point functionality?

Later on, I will need some basic trigonometry, but there are lots of resources on net for that.

Since the HW has 2 MBs of RAM, I don't really care if I can save 2 bytes per soft-float (say - by reserving 9 vs 7 bits in an int). Thus - 4 bytes are good enough.
Also, from brief looking at the 68k instruction manual (and the cycle costs of each instruction), I made few early observations:
- bit-shifting is slow, and unless performance is of critical importance (not the case here), I'd prefer easy debugging of my soft-float library versus 5-cycles-faster code. Besides, since this is C and not 68k ASM, it is obvious that speed is not a critical factor.
- 8-bit operands are as slow as 16-bit (give or take a cycle in most cases), thus it looks like it does not make much sense to compress floats for the sake of performance.

What improvements / approaches would you propose to implement floating-point in C using fixed-point without any dependency on other library/code?

Perhaps it would be possible to use a different approach and do the operations on frac & non-frac parts at the same time?

Here is the code (tested only using the calculator), please ignore the C++ - like declaration and initialization in the middle of functions (I will reformat that to C-style later):

inline int Pad (int f) // Pad the fractional part to 4 digits
{
if (f < 10) return f*1000;
    else if (f < 100) return f*100;
        else if (f < 1000) return f*10;
            else return f;
}

//  We assume fractional parts are padded to full 4 digits 
inline void Add (int & b1, int & f1,  int b2, int f2)
{
b1 += b2;
f1 +=f2;
if (f1 > 9999) { b1++; f1 -=10000; }
else if (f1 < -9999) { b1--; f1 +=10000; }
f1 = Pad (f1);
}

inline void Sub (int & b1, int & f1,  int b2, int f2)
{
    // 123.1652 - 18.9752 = 104.1900
b1 -= b2; // 105
f1 -= f2; // -8100
if (f1 < 0) { b1--; f1 +=10000; }
f1 = Pad (f1);
}

    // ToDo: Implement a multiplication by float
inline void Mul (int & b1, int & f1, int num)
{
    // 123.9876 * 251 = 31120.8876
b1 *=num;   // 30873
long q = f1*num; //2478876
int add = q/10000; // 247
b1+=add; // 31120
f1 = q-(add*10000);//8876
f1 = Pad (f1);
}
    // ToDo: Implement a division by float
inline void Div (int & b1, int & f1, int num)
{
    // 123.9876 / 25 = 4.959504
int b2 = b1/num; // 4
long q = b1 - (b2*num); // 23
f1 = ((q*10000) + f1) / num; // (23000+9876) / 25 = 9595
b1 = b2;
f1 = Pad (f1);
}

What is it you wish to improve? If performance is not critical, the range your solution provides is OK and readability is a factor, why would you look for something else? Do you know what range and precision you'll need? If it's not large, you could consider working with a single int for simplicity. Work with it as a regular int and do a div and mod 100 (or whatever precision you need) when you need the actual value. — Vadim, Aug 20 '13 at 18:19
Well, this is a very brute-force, non-elegant solution. Since I spent a day trying to link other libs (and that's how I found out it is impossible to link them for this specific platform) and only spent half an hour with the make-shift soft-float support, I am willing to spend an hour or more for a more robust / elegant / faster solution. — 3D Coder, Aug 20 '13 at 19:08
Ok, but what are your goals? Why do you feel the need to code more if all your goals are met? I agree that my solution is not extensible and is very simple and thus "not elegant". But if it meets all your goals in the foreseeable future, why not use it? It does have advantages - it requires significantly less code and allows you to work with normal operators instead of functions. — Vadim, Aug 20 '13 at 19:19
Sorry, I meant that my solution is very non-elegant and brute-force, as it was the first one I came up with. Ergo, it must be the worst one and there must be more elegant solutions out there. For me, it is the first foray into embedded world (from linux/win commercial environment), but thousands of coders have been working in emebedded env for decades, so they surely must have approached it somehow, when they couldn't use the switch -msoft-float (or get to link the SoftFloat lib). Can you be more specific about your approach ? How do you handle overflow/underflow ? — 3D Coder, Aug 20 '13 at 20:19
This makes little sense, Motorola produced a 12.5 MHz version of 68000 in 1982, 31 years ago. It went out of production in 1996. What on Earth are you programming? It this thing real or is this a homework assignment? — Hans Passant, Aug 20 '13 at 22:27
It's a multimedia machine, few decades old. And how on earth did you come up with the idea it could be a homework assignment ? I am truly curious. — 3D Coder, Aug 21 '13 at 02:08
@Vadim - now I think I finally get what you meant. The whole and fraction are not separated physically (as is my first solution), but rather logically - after the mathematical operations are done. This is actually way more configurable than my early solution, since I could have different precisions by simply changing the divisor to 10, or 100, or 1000, or 10000. Which would allow to have a softfloat with small non-fractional range (say only <-3,+3> but a much higher range for fractional part. I could expose it via some API call- e.g. ChangePrecision (int FractionalDigits). Great Idea ! — 3D Coder, Aug 21 '13 at 10:59
BTW: the linux-kernel used to have floating point emulation (don't know if it still does) You could probably borrow some of the code. Fixed point representation could also be a choice (easy to implement) — wildplasser, Aug 21 '13 at 11:50
Well, I am pretty sure that would be impossible to pull off in my C dev env, since I don't even have libc (no malloc, nothing - just pure C) and any linux library will surely have a baggage of plenty dependencies that will be uncompileable here. — 3D Coder, Aug 22 '13 at 18:33
note that there's no pass by reference in C, so you can't use such things like `int& b1` — phuclv, Aug 30 '18 at 09:48

score 1 · Answer 1 · answered Aug 21 '13 at 08:07

1

You are thinking in the wrong base for a simple fixed point implementation. It is much easier if you use the bits for the decimal place. e.g. using 16 bits for the integer part and 16 bits for the decimal part (range -32767/32767, precision of 1/2^16 which is a lot higher precision than you have).

The best part is that addition and subtraction are simple (just add the two parts together). Multiplication is a little bit trickier: you have to be aware of overflow and so it helps to do the multiplication in 64 bit. You also have to shift the result after the multiplication (by however many bits are in your decimal).

typedef int fixed16;

fixed16 mult_f(fixed16 op1, fixed16 op2)
{
         /* you may need to do something tricky with upper and lower if you don't
          * have native 64 bit but the compiler might do it for us if we are lucky
          */
         uint64_t tmp;
         tmp = (op1 * op2) >> 16;

          /* add in error handling for overflow if you wish - this just wraps */
         return tmp & 0xFFFFFFFF;
}

Division is similar.

Someone might have implemented almost exactly what you need (or that can be hacked to make it work) that's called libfixmath

answered Aug 21 '13 at 08:07

dave

4,812
4
25
38

2

you should cast op1 or op2 to uint64_t or it'll overflow `tmp = (uint64_t(op1) * op2) >> 16;` – phuclv Aug 21 '13 at 09:16
What exactly do you mean by 'much higher precision' ? I may be missing something, but if the fractional part has 16 bits, that means a range of <-32767,+32768>, thus 5 digits - while my first brute-force solution gives me 4 digits : <-9999,+9999>. I do, however, like a lot the idea of doing add/sub without having to handle any special cases ! That is a definite performance improvement (especially considering the awfully slow CPU (~12 MHz)). – 3D Coder Aug 22 '13 at 18:13
Actually, I just realized it is going to be 4-digits even in your case-since the upper limit is 32768, thus you have full range of values (0-9) only in the 4 digits - meaning the full range is <-9999,+9999>. Or I am not getting what you mean... – 3D Coder Aug 22 '13 at 18:37
1

You are still thinking in base 10. In this implementation each step is 1/2^16 (ie 1/32767) whereas you have 1/10000 for 4 "digits" of precision. I can represent a lot more numbers by using all of the digits. The number is a >> 16 + (a & 0xFFFF) / 0x10000. Note base 2 not base ten calculation! – dave Aug 22 '13 at 23:54

phuclv · Answer 2 · 2021-03-29T08:59:54.640

If you decide to use fixed-point, the whole number (i.e both int and fractional parts) should be in the same base. Using binary for the int part and decimal for the fractional part as above is not very optimal and slows down the calculation. Using binary fixed-point you'll only need to shift an appropriate amount after each operation instead of long adjustments like your idea. If you want to use Q16.16 then libfixmath as dave mentioned above is a good choice. If you want a different precision or floating point position such as Q14.18, Q19.13 then write your own library or modify some library for your own use. Some examples

See also What's the best way to do fixed-point math?

If you want a larger range then floating point maybe the better choice. Write a library as your own requirements, choose a format that is easy to implement and easy to achieve good performance in software, no need to follow IEEE 754 specifications (which is only fast with hardware implementations due to the odd number of bits and strange exponent bits' position) unless you intend to exchange data with other devices. For example a format of exp.sign.significand with 7 exponent bits followed by a sign bit and then 24 bits of significand. The exponent doesn't need to be biased, so to get the base only an arithmetic shift by 25 is needed, the sign bit will also be extended. But in case the shift is slower than a subtraction then excess-n is better.

What are the approaches to SW floating-point implementation?

2 Answers2