Linear fit including all errors with NumPy/SciPy

Question

I have a lot of x-y data points with errors on y that I need to fit non-linear functions to. Those functions can be linear in some cases, but are more usually exponential decay, gauss curves and so on. SciPy supports this kind of fitting with scipy.optimize.curve_fit, and I can also specify the weight of each point. This gives me weighted non-linear fitting which is great. From the results, I can extract the parameters and their respective errors.

There is just one caveat: The errors are only used as weights, but not included in the error. If I double the errors on all of my data points, I would expect that the uncertainty of the result increases as well. So I built a test case (source code) to test this.

Fit with scipy.optimize.curve_fit gives me:

Parameters: [ 1.99900756  2.99695535]
Errors:     [ 0.00424833  0.00943236]

Same but with 2 * y_err:

Parameters: [ 1.99900756  2.99695535]
Errors:     [ 0.00424833  0.00943236]

Same but with 2 * y_err:

So you can see that the values are identical. This tells me that the algorithm does not take those into account, but I think the values should be different.

I read about another fit method here as well, so I tried to fit with scipy.odr as well:

Beta: [ 2.00538124  2.95000413]
Beta Std Error: [ 0.00652719  0.03870884]

Same but with 20 * y_err:

Beta: [ 2.00517894  2.9489472 ]
Beta Std Error: [ 0.00642428  0.03647149]

The values are slightly different, but I do think that this accounts for the increase in the error at all. I think that this is just rounding errors or a little different weighting.

Is there some package that allows me to fit the data and get the actual errors? I have the formulas here in a book, but I do not want to implement this myself if I do not have to.

I have now read about linfit.py in another question. This handles what I have in mind quite well. It supports both modes, and the first one is what I need.

Fit with linfit:
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.00772283  0.04449971]

Same but with 20 * y_err:
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.15445662  0.88999413]

Fit with linfit(relsigma=True):
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.00622595  0.03587451]

Same but with 20 * y_err:
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.00622595  0.03587451]

Should I answer my question or just close/delete it now?

Maybe statsmodels can do this; I'm not sure if it handles general curve fitting. — Fred Foo, May 30 '14 at 10:06
Don't throw away all you have written -- answer it, and who knows, maybe somebody knows an even better way to do it. — eickenberg, May 30 '14 at 12:21
Definitely answer your question with what you have found (and thanks for the comment on one of my previous answers discussing `scipy.odr`). — Ffisegydd, May 30 '14 at 12:34
see `absolute_sigma` option for `curve_fit` in scipy 0.14 http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.curve_fit.html There were long discussion about the meaning of this before this addition was made. — Josef, May 31 '14 at 12:01
@user333700 I have 0.13 installed on my system, and I do not think that this option is there yet. That was just added in 0.14? Then I might have to wait or install it manually. — Martin Ueding, May 31 '14 at 13:03

score 6 · Accepted Answer · answered Dec 04 '16 at 12:21

One way that works well and actually gives a better result is the bootstrap method. When data points with errors are given, one uses a parametric bootstrap and let each x and y value describe a Gaussian distribution. Then one will draw a point from each of those distributions and obtains a new bootstrapped sample. Performing a simple unweighted fit gives one value for the parameters.

This process is repeated some 300 to a couple thousand times. One will end up with a distribution of the fit parameters where one can take mean and standard deviation to obtain value and error.

Another neat thing is that one does not obtain a single fit curve as a result, but lots of them. For each interpolated x value one can again take mean and standard deviation of the many values f(x, param) and obtain an error band:

Further steps in the analysis are then performed again hundreds of times with the various fit parameters. This will then also take into account the correlation of the fit parameters as one can see clearly in the plot above: Although a symmetric function was fitted to the data, the error band is asymmetric. This will mean that interpolated values on the left have a larger uncertainty than on the right.

score 4 · Answer 2 · answered May 30 '14 at 14:49

Please note that, from the documentation of curvefit:

sigma : None or N-length sequence If not None, this vector will be used as relative weights in the least-squares problem.

The key point here is as relative weights, therefore, yerr in line 53 and 2*yerr in 57 should give you similar, if not the same result.

When you increase the actually residue error, you will see the values in the covariance matrix grow large. Say if we change the y += random to y += 5*random in function generate_data():

Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 1.92810458,  3.97843448]))
('Errors:    ', array([ 0.09617346,  0.64127574]))

Compares to the original result:

Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 2.00760386,  2.97817514]))
('Errors:    ', array([ 0.00782591,  0.02983339]))

Also notice that the parameter estimate is now further off from (2,3), as we would expect from increased residue error and larger confidence interval of parameter estimates.

How did you get to increase the actual residue errors? Are you using 0.14 or scipy or something different? — Martin Ueding, May 31 '14 at 13:05
No, having updated to 0.14.x yet. I just changed the the `y += random` to `y += 5*random` in function `generate_data()`, in the code that you linked to. — CT Zhu, Jun 01 '14 at 03:06

Juha · Answer 3 · 2015-10-05T22:32:06.170

Short answer

For absolute values that include uncertainty in y (and in x for odr case):

In the scipy.odr case use stddev = numpy.sqrt(numpy.diag(cov)) where the cov is the covariance matrix odr gives in the output.
In the scipy.optimize.curve_fit case use absolute_sigma=True
flag.

For relative values (excludes uncertainty):

In the scipy.odr case use the sd value from the output.
In the scipy.optimize.curve_fit case use absolute_sigma=False flag.
Use numpy.polyfit like this:

p, cov = numpy.polyfit(x, y, 1,cov = True) errorbars = numpy.sqrt(numpy.diag(cov))

Long answer

There is some undocumented behavior in all of the functions. My guess is that the functions mixing relative and absolute values. At the end this answer is the code that either gives what you want (or doesn't) based on how you process the output (there is a bug?). Also, curve_fit might have gotten the 'absolute_sigma' flag recently?

My point is in the output. It seems that odr calculates the standard deviation as there is no uncertainties, similar to polyfit, but if the standard deviation is calculated from the covariance matrix, the uncertainties are there. The curve_fit does this with absolute_sigma=True flag. Below is the output containing

diagonal elements of the covariance matrix cov(0,0) and
cov(1,1),
wrong way for standard deviation from the outputs for slope and
wrong way for the constant, and
right way for standard deviation from the outputs for slope and
right way for the constant

odr: 1.739631e-06 0.02302262 [ 0.00014863 0.0170987 ] [ 0.00131895 0.15173207] curve_fit: 2.209469e-08 0.00029239 [ 0.00014864 0.01709943] [ 0.0004899 0.05635713] polyfit: 2.232016e-08 0.00029537 [ 0.0001494 0.01718643]

Notice that the odr and polyfit have exactly the same standard deviation. Polyfit does not get the uncertainties as an input so odr doesn't use uncertainties when calculating standard deviation. The covariance matrix uses them and if in the odr case the the standard deviation is calculated from the covariance matrix uncertainties are there and they change if the uncertainty is increased. Fiddling with dy in the code below will show it.

I am writing this here mostly because this is important to know when finding out error limits (and the fortran odrpack guide where scipy refers has some misleading information about this: standard deviation should be the square root of covariance matrix like the guide says but it is not).

import scipy.odr
import scipy.optimize
import numpy

x = numpy.arange(200)
y = x + 0.4*numpy.random.random(x.shape)
dy = 0.4

def stddev(cov): return numpy.sqrt(numpy.diag(cov))

def f(B, x): return B[0]*x + B[1]

linear = scipy.odr.Model(f) 
mydata = scipy.odr.RealData(x, y,  sy = dy)
myodr = scipy.odr.ODR(mydata, linear, beta0 = [1.0, 1.0], sstol = 1e-20, job=00000)
myoutput = myodr.run()
cov = myoutput.cov_beta
sd  = myoutput.sd_beta
p   = myoutput.beta 
print 'odr:        ', cov[0,0], cov[1,1], sd, stddev(cov)

p2, cov2 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b, 
                                    x, y, [1,1],
                                    sigma = dy,
                                    absolute_sigma = False,
                                    xtol = 1e-20)

p3, cov3 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b, 
                                    x, y, [1,1],
                                    sigma = dy,
                                    absolute_sigma = True,
                                    xtol = 1e-20)

print 'curve_fit:  ', cov2[0,0], cov2[1,1], stddev(cov2), stddev(cov3)

p, cov4 = numpy.polyfit(x, y, 1,cov = True)
print 'polyfit:    ', cov4[0,0], cov4[1,1], stddev(cov4)

Linear fit including all errors with NumPy/SciPy

3 Answers3

Linked