Polynomial Fitting using GNU Scientific C

Question

I need to get a n-degree function from n+1 data points. By using the following gnuplot script I get the right fitting:

f(x) = a + b*x + c*x**2 + d*x**3 + e*x**4 + f*x**5 + g*x**6 + h*x**7 + i*x**8 + j*x**9 + k*x**10 + l*x**11 + m*x**12

# Initial values for parameters
a = 0.1
b = 0.1
c = 0.1
d = 0.1
e = 0.1
f = 0.1
g = 0.1
h = 0.1
i = 0.1
j = 0.1
k = 0.1
l = 0.1
m = 0.1


# Fit f to the following data by modifying the variables a, b, c
fit f(x) '-' via a, b, c, d, e, f, g, h, i, j, k, l, m
   4.877263 45.036000
   4.794907 44.421000
   4.703827 43.808000
   4.618065 43.251000
   4.530520 42.634000
   4.443111 42.070000
   4.357077 41.485000
   4.274298 40.913000
   4.188404 40.335000
   4.109381 39.795000
   4.027594 39.201000
   3.946413 38.650000
   3.874360 38.085000
e

After fitting, I got the following coefficients:

a               = -781956        
b               = -2.52463e+06   
c               = 2.75682e+06    
d               = -553791        
e               = 693880         
f               = -1.51285e+06   
g               = 1.21157e+06    
h               = -522243        
i               = 138121         
j               = -23268.8       
k               = 2450.79        
l               = -147.834       
m               = 3.91268

Then, by plotting data and f(x) together it seems that the given coefficients are right: Here one can see the function fitted over the data points.

However, I need to get such fitting by using c code. For several cases, the GNU Scientific Library code for polynomial fit (as in this link) resulted right. But for the above data (and several other cases in my data set) the result I got was flawed.

For instance, the following code (which uses the same data from the e.g. above):

void testOfPolynomialFit(){
   double x[13] = {4.877263, 4.794907, 4.703827, 4.618065, 4.530520, 4.443111, 4.357077, 4.274298, 4.188404, 4.109381, 4.027594, 3.946413, 3.874360};
   double y[13] = {45.036000, 44.421000, 43.808000, 43.251000, 42.634000, 42.070000, 41.485000, 40.913000, 40.335000, 39.795000, 39.201000, 38.650000, 38.085000};
   double coefficients[13];

   polynomialfit(13, 13, x, y, coefficients);

   int i, n = 13;

   for (i = 0; i < n; i++)
   {
    printf("%lf\t", coefficients[i]); 
   }
   printf("\n"); 

}

Results in:

-6817581083.803348      12796304366.105989      -9942834843.404181      3892080279.353104             
 -630964566.517794        -75914607.005088         49505072.518952        -5062100.000931 
   -1426228.491628           514259.312320           -70903.844354            4852.824607
       -136.738756

Which corresponds to a function in the form:

c(x)=-6837615134.799868+12834646330.586414*x**1-9973474377.668280*x**2+3904659818.834625*x**3-633282611.288889*x**4-76066283.747942*x**5+49670960.939126*x**6-5091123.449217*x**7-1426628.818192*x**8+515175.778491*x**9-71055.177018*x**10+4863.969973*x**11-137.065848*x**12

One can check how c(x) looks like here:

c(x) looks

In such image, a(x) and b(x) are functions fitted using "polynomialfit" for just a few of the points (4 and 7).

So, any tips on what I'm doing wrong here? There is some other c code which provides the right fitting?

My maths may be rusty, but for polynomial fitting to work, wouldn't you typically want the number of polynomial coefficients to be less than the number of data points? You can always perfectly fit an `n` degree polynomial to any `n` data points, and if only through a rounding error you are likely to get some odd results. — abligh, Jan 07 '15 at 18:26
Actually, @abligh, you can perfectly fit an `n-1` degree polynomial to any `n` points. — Degustaf, Jan 07 '15 at 18:30
Sure code does not need 14 points for a unique 13th order polynomial? — chux - Reinstate Monica, Jan 07 '15 at 18:30
@abligh: Given n+1 points, there is one and only one n degree polynomial that fits such points. As can mentioned in proof one here [link](), there is no way to solve the polynomial for less than n+1 datapoints. — Isma, Jan 07 '15 at 18:32
I assumed OP was trying to best-fit a lower order polynomial rather than perfect-fit a maximum degree polynomial. — abligh, Jan 07 '15 at 18:33
@chux looking at the source code in the link, you need to pass the number of coefficients, not the degree of the polynomial, despite what they decided to call the variable. — Degustaf, Jan 07 '15 at 18:34
@chux, I have 13 points and, as far as I understood the polynomialfit code, i'm using an 12th order polynomial. (even so I use 13 for the number of points and also for the degree parameter, but if I change the parameter for 12, I get only 12 coefficients. Therefore I understood that the degree in the code accounts for the "error" variable, which is "a" in the gnuplot fit.) — Isma, Jan 07 '15 at 18:37
Hmmm, Wonder how the results look if the degree is 2, 3, 4 .... Does the trouble occur at 13 or show signs with a lesser degree? — chux - Reinstate Monica, Jan 07 '15 at 18:38
There is a degree-12 polynomial which exactly fits, but it's probably not a sensible model of whatever real-world process these numbers came from. The points look very close to a linear relationship. Your big polynomial wiggles around to hit all the points exactly, but as soon as you leave the range `[3.874360:4.877263]` it bends wildly away from the line they're all approximately on. Try `plot [3.5:5.5] f(x), '-'` after your `fit` command to see what I mean — , Jan 07 '15 at 18:52
@chux, I tryed for 2, 4, 6, 8, 10 and 12 coefficients and 12 is the first one I got problemns with. [See here](https://www.dropbox.com/s/sa7btjfjbzb54q9/chartD.png?dl=0) — Isma, Jan 07 '15 at 19:00
@WumpusQ.Wumbley, I'm aware that it is really close to a linear relationship. Actually, I'm trying to relate some errors of one given model with r² of a linear fit, once not all my datapoints are so close to a linear relationship. — Isma, Jan 07 '15 at 19:06
Furthermore notice that changing the initial values from `0.1` to something else - or even changing just `m` to an initial value of `0.10001` - causes gnuplot to end up with a very different answer. Since there's only one exact-fit polynomial we can conclude that gnuplot isn't actually finding it. It's just finding one that happens to come close enough at the specified points to look like a match within the limits of the floating point type. — , Jan 07 '15 at 19:06
@WumpusQ.Wumbley, I modified the initial values as you mentioned and yes, the results of gnuplot are quite different ones. This, as mentioned, expected by the limitation in float or even double representations. However, even different results for each try, the found polynomials are close to the fitted data within acceptable errors. There should be some C code (even the one of gnuplot) which can provide the same kind of result. But the one in GSL is very different and, as mentioned, it is probably caused by and error not accounted for in the polynomialfit code. — Isma, Jan 07 '15 at 19:20

Degustaf · Answer 1 · 2015-01-07T19:29:03.860

2

You are suffering from numerical instability. A simple linear regression confirms what can visually be observed: 99.98% of the variation can be accounted for by using just a linear model.

The code in the link you provided does a number of very unsafe things: not checking that obs or degree are positive, not checking that the memory allocation was successful, not returning anything useful, ...

I would assume that gsl_multifit_linear overflows, or cannot contain the numerical instability, and not checking the return means that we just don't know.

Edit:

According to the GSL Website polynomial regression can lead to extra numerical instabilities due to computing large powers of large numbers. Try preprocessing your x values with x = (x - avg_x) / sd_x. This should allow you to get a couple more degrees of your polynomial before this happens.

In the long run, you are likely to hit this issue again. If you are perfomrnig this analysis with 35 or 100 or more data points it is unlikely that you will be able to find any technique to overcome the instability.

edited Jan 07 '15 at 19:29

answered Jan 07 '15 at 18:52

Degustaf

2,655
2
16
27

I'd expect the code *inside* the gsl library is a little bit higher quality than the code *calling* the gsl library that was written by some rosettacode enthusiast... – Jan 07 '15 at 18:55
Thank you @WumpusQ.Wumbley I didn't look closely enough at the link. I assumed the function was part of the GSL library. Edited to fix this oversight – Degustaf Jan 07 '15 at 19:03
@WumpusQ.Wumbley I'd expect the same for the code inside the gsl library. Can't say the same about the reosettacode one, but I also does't know where is the instability. – Isma Jan 07 '15 at 19:07
@Degustaf, I thought that it could be caused by numerical instability. I tried another unsafe code found around internet, but also got weird results for larger polynomials. However I could explain the variation by a linear model, I need the n-degree polynomial to follow a model and further explain as the model is less susceptible to errors when the data presents itself close to a linear model. – Isma Jan 07 '15 at 19:12
@Isma The instability is coming from your matrix `[x:x^2:x^3:...]`. Because the data fits a linear model so well, the condition number will be huge by the time you try to fit a cubic polynomial, and only get worse from there. – Degustaf Jan 07 '15 at 19:15
I see, however I got a close enough result with gnuplot, which I could use. But as I need to use the coefficients further, it will be useful to have some C code to get them. One workaround would be using gnuplot to fit all my data and then read the files using my c program. By it would be important to have all the functionality within one program only. – Isma Jan 07 '15 at 19:25
[Here is](https://www.dropbox.com/s/dev1z2l752uiknf/chartE.png?dl=0) one example of my dataset were the data is more far from a linear model. – Isma Jan 07 '15 at 19:35

score 2 · Accepted Answer · answered Jan 07 '15 at 19:33

One major difference between your two solution methods is that when you use gnuplot, you're doing a fit to set the coefficients and the plotting the function from there in the same program, whereas with GSL you're copying the numbers from one program to another.

If you're using the output of printf("%lf", ...) as input for your second gnuplot program, you've lost a lot of accuracy because printf is rounding the numbers way more than any of the internal operations of either program. And because this a numerically unstable problem, a little bit of rounding hurts a lot.

When x is 4.877263, x**12 is approximately 181181603.850932 so if your m is off by 0.000001 (printf's default rounding level) that introduces an error of 181.181603850932 which is a relative error of about 300% over the actual y value for that x.

Try %.60lf and see if it gets better.

If one of the programs is using long double internally and the other one isn't, then you're probably not going to get a good match no matter what you do.

Thank you very much. Using %.60lf really worked. The fit was right, only when plotting it I got a bad result. — Isma, Jan 07 '15 at 19:40
Rather than `f` in `%.60lf` suggest `e`. See [Printf width specifier to maintain precision](http://stackoverflow.com/questions/16839658/printf-width-specifier-to-maintain-precision-of-floating-point-value/19897395) for details. — chux - Reinstate Monica, Jan 07 '15 at 19:55
Really I'd prefer `%a` but gnuplot didn't like it when I tried that — , Jan 07 '15 at 19:58

rcgldr · Answer 3 · 2015-01-07T22:18:59.237

I'm a bit confused between the problem statement and the example code. The polynomialfit function would normally expect >= n + 2 data points to fit a polynomial of degree n. In the case of n+1 data points, instead of doing a fit, you generate an exact solution (if floating point rounding wasn't an issue) by generating a matrix with n+1 rows and columns with the given n+1 set of values to represent the set of linear equations:

| x[0]^n + x[0]^(n-1) + ... + x[0] + 1 |  | c[  n] |    | y[0] |
| x[1]^n + x[1]^(n-1) + ... + x[1] + 1 |  | c[n-1] |    | y[1] |
|  ...                                               =  | ...  |
| x[n]^n + x[n]^(n-1) + ... + x[n] + 1 |  | c[  0] |    | y[n] |

So only the c[ ]'s are variables, and the equations are linear. Invert the matrix of the x values then multiply the y values by the inverted matrix to produce the result. There may be issues if the actual polynomial is of lower degree than n (two or more of the equations will not be linearly independent). If this happens, you can use one or more less rows or switch to using a conventional polynomial fit algorithm.

If there are >= n+2 data points, one option is a least squares type polynomial fit. Here is a link to a .rtf document that uses orthogonal and recursively defined polynomials to fit a set of data points. The orthogonal polynomials eliminate the need to invert a matrix, so it can handle polynomials of a greater degree more accurately.

opls.rtf

Example runs with degree = 1 and degree = 3. First column is original x value, second column is original y value, third column is calculated y value, 4th column is ( original y value - calculated y value ) :

variance =  9.6720e-004
 6.8488e+000 X**1 +  1.1619e+001

 4.877263e+000   4.503600e+001   4.502243e+001   1.356604e-002
 4.794907e+000   4.442100e+001   4.445839e+001  -3.739271e-002
 4.703827e+000   4.380800e+001   4.383460e+001  -2.660237e-002
 4.618065e+000   4.325100e+001   4.324723e+001   3.765950e-003
 4.530520e+000   4.263400e+001   4.264765e+001  -1.365429e-002
 4.443111e+000   4.207000e+001   4.204901e+001   2.099404e-002
 4.357077e+000   4.148500e+001   4.145977e+001   2.522524e-002
 4.274298e+000   4.091300e+001   4.089284e+001   2.016354e-002
 4.188404e+000   4.033500e+001   4.030456e+001   3.043591e-002
 4.109381e+000   3.979500e+001   3.976335e+001   3.165004e-002
 4.027594e+000   3.920100e+001   3.920321e+001  -2.205684e-003
 3.946413e+000   3.865000e+001   3.864721e+001   2.788204e-003
 3.874360e+000   3.808500e+001   3.815373e+001  -6.873392e-002

variance =  2.4281e-004
 8.0952e-001 X**3 + -1.0822e+001 X**2 +  5.4910e+001 X**1 + -5.9287e+001

 4.877263e+000   4.503600e+001   4.502276e+001   1.324045e-002
 4.794907e+000   4.442100e+001   4.444280e+001  -2.180419e-002
 4.703827e+000   4.380800e+001   4.381431e+001  -6.306292e-003
 4.618065e+000   4.325100e+001   4.323170e+001   1.929905e-002
 4.530520e+000   4.263400e+001   4.264294e+001  -8.935141e-003
 4.443111e+000   4.207000e+001   4.205786e+001   1.214442e-002
 4.357077e+000   4.148500e+001   4.148153e+001   3.468503e-003
 4.274298e+000   4.091300e+001   4.092369e+001  -1.069376e-002
 4.188404e+000   4.033500e+001   4.033844e+001  -3.436876e-003
 4.109381e+000   3.979500e+001   3.979160e+001   3.397859e-003
 4.027594e+000   3.920100e+001   3.921454e+001  -1.354191e-002
 3.946413e+000   3.865000e+001   3.862800e+001   2.199866e-002
 3.874360e+000   3.808500e+001   3.809383e+001  -8.830768e-003

Polynomial Fitting using GNU Scientific C

3 Answers3