2

I would like to evaluate a polynomial on a real time embedded system that has only 32 bit integer hardware. For this reason, I am trying to use fixed point arithmetic. How can I avoid overflows without also imposing ridiculous limitations on the parameters?

Suppose I have the coefficients a,b,c,d and I want to evaluate

ax^3 + bx^2 + cx + d

for a certain range of x.

Suppose the coefficients a,b,c,d and the range for x can be computed off-line and can be scaled to make whatever method I use to evaluate the polynomial work.

What can I do to avoid overflows but still have about 20 bits of precision in the result? If I do nothing, then even for small values of x (like 10,000) x^3 is 1,000,000,000,000 which won't fit in 32 bits.

To give an example, suppose I want to evaluate the polynomial

F(x) = ax^3

For x in range x=<0.0,1.0>. I want F(0.0) = 0.0 and F(1.0) = 100.0. But I also want the value of this function at 10,000 points in that range, so F(0.0001), F(0.0002) etc.

If I want the result of F(x) to always be accurate to the nearest integer, how should I evaluate F(x) using only 32 bit integer math?

Spektre
  • 49,595
  • 11
  • 110
  • 380
Filipp
  • 1,843
  • 12
  • 26
  • 1
    You need to define the "certain range of x", i.e. max, min, and required resolution. – Paul R Mar 18 '16 at 15:45
  • 1
    Possible duplicate of [How to detect integer overflow in C/C++?](http://stackoverflow.com/questions/199333/how-to-detect-integer-overflow-in-c-c) – xvan Mar 18 '16 at 15:45
  • I'm not very sure that you need fixed point arithmetic, maybe you meant long-math? – Iłya Bursov Mar 18 '16 at 15:46
  • Are you sure that you platform is not abel to manage uint64_t variables? What is your platform? What is your compiler and OS? – LPs Mar 18 '16 at 15:56
  • BTW take a look at libtommath of [libtomcrypt](https://github.com/libtom/libtomcrypt) – LPs Mar 18 '16 at 16:11
  • Where do you want the precision be, to the left or right of the point? – user3528438 Mar 18 '16 at 16:34
  • There's no compiler and OS. I'm writing assembly code for an embedded chip. – Filipp Mar 18 '16 at 16:53
  • You might have to write your own floating point arithmetic. Even if you did have compiler implementation of `float` they can be so inefficient it is necessary anyway. – Weather Vane Mar 18 '16 at 17:09
  • Another reason why I want 32 bit integer math is because this code has very tight real time constraints. I have to count the number of cycles each instruction takes. – Filipp Mar 18 '16 at 17:14
  • 1
    Need to define the possible range of values for `a,b,c,d`. – chux - Reinstate Monica Mar 18 '16 at 21:53
  • 1
    Typically you evaluate this as `c0+x*(c1+x*(c2+x*c3))`. For fixed-point arithmetic with `q=1.0`, use `c0+(x*(c1+(x*(c2+(x*c3)/q))/q))/q` instead. The temporaries are always double wide; how to accomplish that depends on the architecture. (Many processors support 32+32=64bit multiply, even if they do not support 64-bit arithmetic.) If your architecture has features that allows a fast fused-multiply-add function for fixed-point or for emulated floating point type, then use `fma(x, fmaf(x, fmaf(x, c3, c2), c1), c0)`. (`fma(a,b,c) = a * b + c`.) – Nominal Animal Mar 21 '16 at 15:01

4 Answers4

1

Solve ax^3 + bx^2 + cx + d = MAXINT - 1

This give the maximum input value that will not overflow.

Then scale your input values accordingly. Evaluate the polynomial, then reverse the scaling.

cwallach
  • 61
  • 1
0

Maybe use some kind of scientific notation? 1,000,000,000,000 is 1.0 x 10^12, so you could use 20 bits to store the bit before the exponent, then the remaining 12 bits could be used for extra precision or the exponent.

Dave the Sax
  • 328
  • 2
  • 12
0

Define and implement fixed point data types and arithmetic operations.

Before calculation, convert all real number to fixed point format and operate on them.

Also make consumer accept fixed point data, or convert fixed point data to integer, or float.

The Q value, aka scaling factor, is a design choice that trades offs between precision and range.

#include <stdio.h>
#include <stdint.h>

typedef int32_t Q16;

Q16 floatToQ16(float in)
{
    return in*(1<<16);  
}
float q16ToFloat(Q16 in)
{
    return (float)in/(1<<16);
}
Q16 q16mul(Q16 a, Q16 b)
{
    return (int64_t)a*b>>16;
}
Q16 q16pol(Q16 a, Q16 b, Q16 c, Q16 x)
{
    return q16mul(a, q16mul(q16mul(x, x), x)) 
          + q16mul(b,q16mul(x,x)) 
          + q16mul(c, x); 
}
int main(void) {
    printf("2*20.123^3+3*20.123^2+4*20.123=%f\n",
            q16ToFloat(q16pol(floatToQ16(2.0f), 
                              floatToQ16(3.0f), 
                              floatToQ16(4.0f), 
                              floatToQ16(20.123f))));
    return 0;
}

Demo: http://ideone.com/5vVNb8

Reference:

GCC API of fixed point arithmetics

N1169 draft, fixed point support in C programming language

user3528438
  • 2,737
  • 2
  • 23
  • 42
  • That assumes no value or intermediate result can be larger than 2^15 or smaller than 2^-16 in magnitude. What happens when I need to exceed those limits? – Filipp Mar 18 '16 at 18:12
  • @Filipp Decrease `Q`. – user3528438 Mar 18 '16 at 18:15
  • Great! Now my limits are 2^16 to 2^-15 and I'm left with an equivalent problem. – Filipp Mar 18 '16 at 18:16
  • 1
    If you need the point floating then you need floating point for sure. Fixed point can only provide a fixed range and precision. And I said the choice of `Q` is a design decision and need to custom fit to work with certain precision and range requirements. However in practice Q value can often be dynamic. If you can implement a good estimate of required `Q` (for example, estimate log2(x) by finding where the most significant 1 of the bit pattern is), you can have a more flexible/adaptive Q format. – user3528438 Mar 18 '16 at 18:24
0
  1. fractional bits

    Your example tells us that x=<0.0,1.0> with at least 10000 steps so you need to represent min_dx=1.0/10000.0=0.0001.

    Beware decadic fractional part is not easily convertible to binary representation so if you use 10000 steps for booth fractional bits computation and as a number of steps then the difference between points will be nonlinear (oscillating around 0.0001). To get it more precise add one or few more bits.

    fract_bits=ceil(log2(10000))=14;
    min_dx=1/(2^14)=1/16384=0.00006103515625
    

    to get to the a bit safer use 15 fractional bits instead so

    min_dx=1/(2^15)=1/32768=0.000030517578125
    
  2. integer bits

    Assuming 2'os complement and signed integer arithmetics we have left with:

    int_bits=32-fract_bits=32-15=17
    absmax=2^(int_bits-1)=2(16)=16384
    

    so if no sub-result of your polynomial expression is exceeding range <-16384,16383> for any of valid x then you can use fixed point arithmetics with 17.15 bits

  3. multiple fixed point formats

    if any sub-result of your polynomial expression is exceeding range <-16384,16383> for any of valid x then you need to use more than one fixed formats. For range x=<0.0,1.0> this can occur only if any of the |a|,|b|,|c|,|d| is bigger then absmax/4. As you claim the |result|<=100.0 then this should not happen that often.

    Anyway if your x range would be x=<0.0,2.0> then the x^0,x^1,x^2,x^3 would have different ranges and can more easily overflow the sub results. in such cases you can use lower fract_bits for the higher powers of x using different fixed point format per polynomial part and at the end compute all of them to the single output (with the least num of fractional bits).

    If lowering of the fract_bits is not possible due to precision costs then you need to use 64 bit (or more) arithmetics instead (can be done with 32bit operations).

Spektre
  • 49,595
  • 11
  • 110
  • 380