Fitting curve: why small numbers are better?

Question

I spent some time these days on a problem. I have a set of data:

y = f(t), where y is very small concentration (10^-7), and t is in second. t varies from 0 to around 12000.

The measurements follow an established model:

y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)

And I need to find Vs, Vi, and k. So I used curve_fit, which returns the best fitting parameters, and I plotted the curve.

And then I used a similar model:

y = (Vs * t/3600 - ((Vs - Vi) * (1 - np.exp(-k * t/3600)) / k)) * 10**7

By doing that, t is a number of hour, and y is a number between 0 and about 10. The parameters returned are of course different. But when I plot each curve, here is what I get:

https://i.stack.imgur.com/AQNpI.png

The green fit is the first model, the blue one with the "normalized" model. And the red dots are the experimental values.

The fitting curves are different. I think it's not expected, and I don't understand why. Are the calculations more accurate if the numbers are "reasonnable" ?

It's worth noting that this question follows from http://stackoverflow.com/questions/18656232/overflow-with-numpy-exp — Hooked, Sep 06 '13 at 17:59
Yes it does, I forgot to say it. In fact, the initial parameters caused the overflow. With your help and your tips, I don't furnish these parameters anymore, and I don't have any overflow. — JPFrancoia, Sep 06 '13 at 18:51

unutbu · Accepted Answer · 2013-09-07T12:41:53.347

6

The docstring for optimize.curve_fit says,

p0 : None, scalar, or M-length sequence
    Initial guess for the parameters.  If None, then the initial
    values will all be 1 (if the number of parameters for the function
    can be determined using introspection, otherwise a ValueError
    is raised).

Thus, to begin with, the initial guess for the parameters is by default 1.

Moreover, curve fitting algorithms have to sample the function for various values of the parameters. The "various values" are initially chosen with an initial step size on the order of 1. The algorithm will work better if your data varies somewhat smoothly with changes in the parameter values that on the order of 1.

If the function varies wildly with parameter changes on the order of 1, then the algorithm may tend to miss the optimum parameter values.

Note that even if the algorithm uses an adaptive step size when it tweaks the parameter values, if the initial tweak is so far off the mark as to produce a big residual, and if tweaking in some other direction happens to produce a smaller residual, then the algorithm may wander off in the wrong direction and miss the local minimum. It may find some other (undesired) local minimum, or simply fail to converge. So using an algorithm with an adaptive step size won't necessarily save you.

The moral of the story is that scaling your data can improve the algorithm's chances of of finding the desired minimum.

Numerical algorithms in general all tend to work better when applied to data whose magnitude is on the order of 1. This bias enters into the algorithm in numerous ways. For instance, optimize.curve_fit relies on optimize.leastsq, and the call signature for optimize.leastsq is:

def leastsq(func, x0, args=(), Dfun=None, full_output=0,
            col_deriv=0, ftol=1.49012e-8, xtol=1.49012e-8,
            gtol=0.0, maxfev=0, epsfcn=None, factor=100, diag=None):

Thus, by default, the tolerances ftol and xtol are on the order of 1e-8. If finding the optimum parameter values require much smaller tolerances, then these hard-coded default numbers will cause optimize.curve_fit to miss the optimize parameter values.

To make this more concrete, suppose you were trying to minimize f(x) = 1e-100*x**2. The factor of 1e-100 squashes the y-values so much that a wide range of x-values (the parameter values mentioned above) will fit within the tolerance of 1e-8. So, with un-ideal scaling, leastsq will not do a good job of finding the minimum.

Another reason to use floats on the order of 1 is because there are many more (IEEE754) floats in the interval [-1,1] than there are far away from 1. For example,

import struct
def floats_between(x, y):
    """
    http://stackoverflow.com/a/3587987/190597 (jsbueno)
    """
    a = struct.pack("<dd", x, y)
    b = struct.unpack("<qq", a)
    return b[1] - b[0]

In [26]: floats_between(0,1) / float(floats_between(1e6,1e7))
Out[26]: 311.4397707054894

This shows there are over 300 times as many floats representing numbers between 0 and 1 than there are in the interval [1e6, 1e7]. Thus, all else being equal, you'll typically get a more accurate answer if working with small numbers than very large numbers.

edited Sep 07 '13 at 12:41

answered Sep 06 '13 at 17:49

unutbu

842,883
184
1,785
1,677

I think you might have a point here, and it's also possible that the initial parameters are more adapted in this case. Could you tell me more about the tolerance ? I'm afraid I don't understand clearly what it is. – JPFrancoia Sep 07 '13 at 11:43
1

The `leastsq` algorithm stops when the sum of the squares of the differences (the so-called "residuals") between the fitted model and the supplied data is less than the `ftol`, or if the relative error between two iterations is less than `xtol`. If the unscaled data is very tiny then the residuals could be extremely tiny (since we are squaring) without the fit being very good. – unutbu Sep 07 '13 at 12:10
Got it. Is there a way to have a confidence or error range for the parameters ? – JPFrancoia Sep 07 '13 at 18:30
1

This is a domain I don't understand well myself. As I understand it, there is a [formula for estimating confidence intervals](http://jkitchin.github.io/blog/2013/02/12/Nonlinear-curve-fitting-with-parameter-confidence-intervals/), but strictly speaking it only applies to *linear* regression, with the assumption that the residuals are normally distributed with mean 0 (Gaussian assumption). Under the same assumption, the formula can be used with *non-linear* regression, but then the results are asymptotic, and [other assumptions must also apply](http://bit.ly/17Kgw7f). – unutbu Sep 08 '13 at 00:50
There is another approach, [the boostrap method](http://www.variousconsequences.com/2010/02/visualizing-confidence-intervals.html), which does not require so many assumptions -- only that the residuals are not a function of the independent variable. An example, using the bootstrap method, [can be found here](http://stackoverflow.com/a/5811246/190597). One drawback of the bootstrap method is that it requires numerous calls to `optimize.curve_fit`, and it is a random (as opposed to deterministic) method. – unutbu Sep 08 '13 at 00:59
I think the boostrap method is a bit unusual for my domain, but it looks promising. Anyway, @unutbu, let's admit my residuals are normally distributed (it's not the case for my data, but only a small number of my measurements have a problem). For the other assumptions: These are asymptotic results: you need a lot of data -> I have plenty. So if I follow your first link, I don't even need to perform the student-t test. One should be sure that that there is only one minimum, or that the optimizer always falls in the same minimum. I assume it's ok. – JPFrancoia Sep 08 '13 at 08:10
But what do they mean by: The Jacobian should be non singular ? – JPFrancoia Sep 08 '13 at 08:10
[The Jacobian](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant#Jacobian_matrix) is a matrix of partial derivatives of the function to be minimized with respect to the parameters. "Nonsingular" means the matrix is invertible. [Here is an example](http://stackoverflow.com/a/3965515/190597) showing what the Jacobian looks like. – unutbu Sep 08 '13 at 10:38
In your case the Jacobian is 1x3, since there are 3 parameters and the function to be minimized returns a scalar. "Nonsingular" is a term applied to square matrices. I'm not sure what the criterion would mean in your case, or if it applies at all. – unutbu Sep 08 '13 at 11:35
Okay, it's way beyond my comprehension. Let's take it simple: do you think I can use this method to calculate the confidence interval ? – JPFrancoia Sep 08 '13 at 12:55
I think it depends on your audience. Among statistical unsophisticates, it would probably pass without complaint. A mathematician or statistician may question the assumptions, or point out other ways this is lacking (asymptotic does not mean the result makes sense when n=1e6. We need some bound on the error...). The beauty of the bootstrap method is that it samples the distribution of the parameters directly. It is easier to understand so I think the result is more reliable (if one isn't willing/able to do the difficult but rigorous analysis the formula-based method would require). – unutbu Sep 08 '13 at 13:29
Ok, I will look through it so. Thanks. – JPFrancoia Sep 08 '13 at 14:30

score 2 · Answer 2 · answered Sep 06 '13 at 17:45

I would imagine it has more to do with the initial parameter estimates you are passing to curve fit. If you are not passing any I believe they all default to 1. Normalizing your data makes those initial estimates closer to the truth. If you don't want to use normalized data just pass the initial estimates yourself and give them reasonable values.

Bas Swinckels · Answer 3 · 2013-09-06T19:33:03.667

Others have already mentioned that you probably need to have a good starting guess for your fit. In cases like this is, I usually try to find some quick and dirty tricks to get at least a ballpark estimate of the parameters. In your case, for large t, the exponential decays pretty quickly to zero, so for large t, you have

y == Vs * t - (Vs - Vi)  / k

Doing a first-order linear fit like

[slope1, offset1] = polyfit(t[t > 2000], y[t > 2000], 1)

you will get slope1 == Vs and offset1 == (Vi - Vs) / k.

Subtracting this straight line from all the points you have, you get the exponential

residual == y - slope1 * t - offset1 == (Vs - Vi) * exp(-t * k)

Taking the log of both sides, you get

log(residual) == log(Vs - Vi) - t * k

So doing a second fit

[slope2, offset2] = polyfit(t, log(y - slope1 * t - offset1), 1)

will give you slope2 == -k and offset2 == log(Vs - Vi), which should be solvable for Vi since you already know Vs. You might have to limit the second fit to small values of t, otherwise you might be taking the log of negative numbers. Collect all the parameters you obtained with these fits and use them as the starting points for your curve_fit.

Finally, you might want to look into doing some sort of weighted fit. The information about the exponential part of your curve is contained in just the first few points, so maybe you should give those a higher weight. Doing this in a statistically correct way is not trivial.

Fitting curve: why small numbers are better?

3 Answers3

Linked