How to implement multiplication in fixed-point

Question

I want to implement multiplication in C and in fixed-point with scalable fraction bitwidth (i.e. from 1 bit to 30 bit), I know the simplest way is like this:

typedef int32_t fixedpt;
typedef int64_t fixedptd;
fixedpt fixedpt_mul(fixedpt A, fixedpt B)
{
    return (((fixedptd)A * (fixedptd)B) >> FIXEDPT_FBITS);
}

but let's suppose that I can not use int64_t, so this method is easy to overflow when the fractional bitwidth is large. I found an existing repo: libfixmath that separates the integer parts and fraction parts before multiplication, this method alleviates the above issue, but it only implements the 16 fraction bitwidth, so I modify it to fit more general cases:

#include <stdint.h>
#include <stdio.h>

typedef int32_t fix_t;
// w means fraction bitwidth
#define fix_ONE(w)      ((fix_t)(0x00000001 << w))
#define MASK_M(w)       (((fix_t)1 << w) - 1) // mask of mantissa, all LSB set, all MSB clear
#define fix_rconst(x, w) ((fix_t)((x) * fix_ONE(w) + ((x) >= 0 ? 0.5 : -0.5)))  // convert a const to rounded fix_t
static const fix_t fix_overflow = 0x80000000; // 1000 0000 0000 0000 0000 0000 0000 0000
static inline float fix_to_float(fix_t a, int w) { return ((float)a / fix_ONE(w)); }

fix_t fix_mul(fix_t inArg0, fix_t inArg1, int w) {
    // w is fractional bitwidth
    // separate interger part and fraciton part
    int32_t A = (inArg0 >> w), C = (inArg1 >> w);
    uint32_t B = (inArg0 & MASK_M(w)), D = (inArg1 & MASK_M(w));
    
    int32_t AC = A*C;
    int32_t AD_CB = A*D + C*B;
    uint32_t BD = B*D;
    
    int32_t product_hi = AC + (AD_CB >> w); // product_hi stands for the interger part of final result
    
    uint32_t ad_cb_temp = AD_CB << (32-w); // get the fraction part of AD_CB
    uint32_t product_lo = BD + ad_cb_temp;  //product_lo stands for the fraction part of final result

    if (product_lo < BD || product_lo < ad_cb_temp) { // check if product_lo overflow
        product_hi++;
    } 
    
    // The upper part bits should all be the same (including the sign).
    if (product_hi >> 31 != product_hi >> (31-w)) {
        printf("Overflow in fix_mul(), please use other bitwidth \n");
        return fix_overflow;
    }
    // combine interger part and fraction part
    return (product_hi << (w)) | (product_lo >> (32-w));
}

int test_mul(void)
{
    // test cases
    float a = 0.267f; //0.50f;//1.267f; //-1.267f;//-1.267f;
    float b = 0.849f; //0.25f;//1.849f; //1.849f; //-1.849f;
    for (int w = 1; w < 28; w++) {
        fix_t aa = fix_rconst(a, w);
        fix_t bb = fix_rconst(b, w);
        fix_t result = fix_mul(aa, bb, w);
        float fresult = fix_to_float(result, w);
        printf("fix_rconst(%f, %i) = %i, fix_rconst(%f, %i) = %i, result = %i, fresult=%f \n", a, w, aa, b, w, bb, result, fresult);
    }
    return 0;
}

int main()
{
    test_mul();
    // system("pause"); 
    return 0;
}

You can the code online here. But the test results are not correct except the 16 bitwidth, the expected result is around 0.267*0.849=0.226683, it is acceptable there is a minor error for small fractional bitwidth, i.e. lower precision:

fix_rconst(0.267000, 1) = 1, fix_rconst(0.849000, 1) = 2, result = 1, fresult=0.500000 
fix_rconst(0.267000, 2) = 1, fix_rconst(0.849000, 2) = 3, result = 0, fresult=0.000000 
fix_rconst(0.267000, 3) = 2, fix_rconst(0.849000, 3) = 7, result = 0, fresult=0.000000 
fix_rconst(0.267000, 4) = 4, fix_rconst(0.849000, 4) = 14, result = 0, fresult=0.000000 
fix_rconst(0.267000, 5) = 9, fix_rconst(0.849000, 5) = 27, result = 0, fresult=0.000000 
fix_rconst(0.267000, 6) = 17, fix_rconst(0.849000, 6) = 54, result = 0, fresult=0.000000 
fix_rconst(0.267000, 7) = 34, fix_rconst(0.849000, 7) = 109, result = 0, fresult=0.000000 
fix_rconst(0.267000, 8) = 68, fix_rconst(0.849000, 8) = 217, result = 0, fresult=0.000000 
fix_rconst(0.267000, 9) = 137, fix_rconst(0.849000, 9) = 435, result = 0, fresult=0.000000 
fix_rconst(0.267000, 10) = 273, fix_rconst(0.849000, 10) = 869, result = 0, fresult=0.000000 
fix_rconst(0.267000, 11) = 547, fix_rconst(0.849000, 11) = 1739, result = 0, fresult=0.000000 
fix_rconst(0.267000, 12) = 1094, fix_rconst(0.849000, 12) = 3478, result = 3, fresult=0.000732 
fix_rconst(0.267000, 13) = 2187, fix_rconst(0.849000, 13) = 6955, result = 29, fresult=0.003540 
fix_rconst(0.267000, 14) = 4375, fix_rconst(0.849000, 14) = 13910, result = 232, fresult=0.014160 
fix_rconst(0.267000, 15) = 8749, fix_rconst(0.849000, 15) = 27820, result = 1856, fresult=0.056641 
fix_rconst(0.267000, 16) = 17498, fix_rconst(0.849000, 16) = 55640, result = 14855, fresult=0.226669 
fix_rconst(0.267000, 17) = 34996, fix_rconst(0.849000, 17) = 111280, result = 118846, fresult=0.906723 
fix_rconst(0.267000, 18) = 69992, fix_rconst(0.849000, 18) = 222560, result = 164338, fresult=0.626900 
fix_rconst(0.267000, 19) = 139985, fix_rconst(0.849000, 19) = 445121, result = 266201, fresult=0.507738 
fix_rconst(0.267000, 20) = 279970, fix_rconst(0.849000, 20) = 890241, result = 32390, fresult=0.030890 
fix_rconst(0.267000, 21) = 559940, fix_rconst(0.849000, 21) = 1780482, result = 259120, fresult=0.123558 
fix_rconst(0.267000, 22) = 1119879, fix_rconst(0.849000, 22) = 3560964, result = 2069485, fresult=0.493404 
fix_rconst(0.267000, 23) = 2239758, fix_rconst(0.849000, 23) = 7121928, result = 8167272, fresult=0.973615 
fix_rconst(0.267000, 24) = 4479517, fix_rconst(0.849000, 24) = 14243856, result = 15062169, fresult=0.897775 
fix_rconst(0.267000, 25) = 8959033, fix_rconst(0.849000, 25) = 28487712, result = 19611502, fresult=0.584468 
fix_rconst(0.267000, 26) = 17918066, fix_rconst(0.849000, 26) = 56975424, result = 22674290, fresult=0.337873 
fix_rconst(0.267000, 27) = 35836132, fix_rconst(0.849000, 27) = 113950848, result = 47176592, fresult=0.351493

---------------------- UPDATE --------------------------

I found a tempory workaround. The error comes from the fractional part, i.e. B and D. The workaround is inserting following code:


    uint32_t B = (inArg0 & MASK_M(w)), D = (inArg1 & MASK_M(w));

    // Inserted code here
    if(w < 16){
        B <<= (16-w);
        D <<= (16-w);
    }else{
        B >>= (w-16);
        D >>= (w-16);
    }

    int32_t AC = A*C;

but this workaround doesn't work if a = 5.567, b = 2.7835.

The simplest solution is to do the multiplication using `int64_t`. — user3386109, Mar 13 '23 at 06:41
@user3386109 Unfortunately, I was limited to run the code on 32 bit OS. — Wade Wang, Mar 13 '23 at 06:45
You can use `int64_t` on a 32 bit OS also, and on 32 bit or even 8 bit processors. — jpa, Mar 13 '23 at 06:47
I just konw this point, I found a related question here: https://stackoverflow.com/questions/5530906/do-i-need-to-have-64-bit-processor-to-use-64-bit-data-type. Sorry for my wrong expression, let's assume that I am limited with 32 bit compiler. — Wade Wang, Mar 13 '23 at 07:44
@Wade Wang, `#define fix_ONE(w) ((fix_t)(0x00000001 << w))` is wrong when `int` is 16-bit. Should be `#define fix_ONE(w) (((fix_t)1) << (w)))`. — chux - Reinstate Monica, Mar 13 '23 at 08:47
@Wade Wang, Rounding is problematic too. Fails corner cases. Please describe your desired rounding. (Using FP math is the wrong approach here.) — chux - Reinstate Monica, Mar 13 '23 at 08:50
@WadeWang "test results are not correct except the 16 bitwidth:" --> Post the expected results. Simply saying "not correct" is insufficient. — chux - Reinstate Monica, Mar 13 '23 at 08:57
@chux-ReinstateMonica Done. As for roungding, for simplicity, here I didn't consider rounding, but the libfixmath has rounding implementation : https://github.com/PetteriAimonen/libfixmath/blob/797e13120b6600c76a0d00dc0ab774af2d97701d/libfixmath/fix16.c#L149-L166 — Wade Wang, Mar 13 '23 at 09:24
`uint32_t ad_cb_temp = AD_CB << (32-w);` looks dodgy compared to the original `uint32_t ad_cb_temp = AD_CB << 16;`. Consider reviewing. — chux - Reinstate Monica, Mar 13 '23 at 09:35

jpa · Answer 1 · 2023-03-13T07:31:46.007

1

// w is fractional bitwidth
// separate interger part and fraciton part
int32_t A = (inArg0 >> w), C = (inArg1 >> w);
uint32_t B = (inArg0 & MASK_M(w)), D = (inArg1 & MASK_M(w));

The point of this code is to divide the 32-bit values into two parts. It is only coincidental that in libfixmath also the fractional bit count is 16. Even if you use a different fixed point location, the actual multiplication can remain identical.

Here is an adapted image from some earlier work of mine. The shift amount w only affects the final result usage, not the A/B/C/D values. Note that when w is not 16, the result will not be neatly aligned to word boundaries and has to be shifted.

edited Mar 13 '23 at 07:31

answered Mar 13 '23 at 06:55

jpa

10,351
1
28
45

It seems that the shift amount `w` will affect the B D value, for example, `a = 1.267f, b = 1.849f`, if `w=14`, then `inArg0=20759, inArg1=30294, B=4375, D=13910`. If `w = 18`, then `inArg0=332136, inArg1=484704, B=69992, D=222560`. The difference of B and D leads to the difference of AD_CB and BD. – Wade Wang Mar 13 '23 at 09:08
@WadeWang It's just integer multiplication. The remaining offset in result, including in the B D subcomponent, will depend on `w`. But it's best to shift that out as the last step, not at the beginning: each component has to be max 16 bits if the hardware multiplication result is limited to 32 bit. – jpa Mar 13 '23 at 09:48
What would the illustration looks like if `w` ≠ 16 ？ – Wade Wang Mar 14 '23 at 03:28
@WadeWang Only the bottom row changes, by the width of the discarded bottom bits. – jpa Mar 14 '23 at 06:09

score 0 · Answer 2 · answered Mar 13 '23 at 07:22

0

If thes type fixedpt and fixedptd are respectively int32_t and int64_t, the simplistic implementation works for all fraction bit widths except maybe for the rounding method. You might want to try this variation:

fixedpt fixedpt_mul(fixedpt A, fixedpt B)
{
    return (((fixedptd)A * (fixedptd)B + (1 << (FIXEDPT_FBITS - 1))) >> FIXEDPT_FBITS);
}

The C Standard mandates type long long to have at least 63 value bits so the above should be supported by the target if it complies with even an old version of the Standard.

If you do not want to use 64-bit multiplication, you can use the 16x16 -> 32 multiplication from the posted code to emulate it and then round and shift the final result right by FIXEDPT_FBITS bits, which will require some adjustments depending on whether FIXEDPT_FBITS >= 16 or not.

answered Mar 13 '23 at 07:22

chqrlie

131,814
10
121
189

It seems even though 32 bit CPU support `int64_t`, but it is less efficient, so I am curious if we can implement multiplication without `int64_t`, just with `int32_t` and `uint32_t`. – Wade Wang Mar 13 '23 at 08:15
1

If the CPU supports it in hardware, it is quite unlikely to be less efficient than a software approach. If it is emulated with inline code or in the support library, testing will show if you can beat the compiler guys, but I am not sure the code from the question is correct for negative arguments and the rounding is also questionable. – chqrlie Mar 13 '23 at 08:42
@WadeWang "even though 32 bit CPU support int64_t, but it is less efficient" --> No. Best to assume the compiler `int64_t` code is more efficient than your non-`int64_t` crafted code - unless you have evidence to the contrary. – chux - Reinstate Monica Mar 13 '23 at 09:01

score 0 · Answer 3 · answered Mar 16 '23 at 17:08

in order to implement fixed point decimal computations, let's say of N decimal digits after decimal point you can consider that the numbers are just plain integer numbers (with the unit at the tiniest subinteger decimal number) when you add/subtract two numbers, you have no adjustment to do, whenever you multiply two numbers you have to multiply each result by 10^N (as both numbers contribute, each, with a 10^N power, you have to divide by one of those, to make the decimal point to be at the proper place). This will make you to have to multiply the result by 10^N on division (the two powers of ten cancel each other on division, so you have to provide one factor to set the decimal point at the proper place) and, on printing, you just have to put one decimal point N places before the last decimal digit (even being it a zero)

Conclussion, use plain integers, taking care on multiplications to divide the result (after the multiplication) by 10^N, and multiply the quotient of a division by 10^N. Of course, as the unit of measure is now 10^n times smaller, you will have easier overflows, so always use full 64bit integers.

You can, in case you will need values greater than what 64bit integer arithmetic can give you, to use a multiprecision library like gmp or similar (all multiprecision libraries handle fixed point decimal types, normally)

How to implement multiplication in fixed-point

3 Answers3