1

I'm trying to fit function to a data set of an experiment using python. I can get a really good approximation and the fit looks pretty good, but the error given for the parameters is incredibly high and I'm not sure how to fix this.

The function looks like this: Function

The data consist of the a time data set and a y data set. The variable "ve" is a linear velocity function, that's why in the code it is replaced with "a*x+b". Now the fit looks really good and theoretically the function should fit the data, but the error is crazily high. The code is the following:

import operator
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lmfit import Model
from numpy import log, linspace, random
from matplotlib import colors as mcolors
from scipy.optimize import curve_fit

data6 = pd.read_csv('2594.csv')
x=data6.iloc[:18,0]
y=data6.iloc[:18,1]

def func(x, a, b, mo, q):
    return (4.9+a*x+b)*x+(a*x+b)*((mo/q)-x)*log(1-(q*x)/mo)-0.5*9.8*x*x

popt, pcov = curve_fit(func,x,y,bounds=((0, -100, 0, 0), (1000, 1000, 1, 1)))
plt.plot(x, func(x, *popt), 'g--', label='fit: a=%5.3f, b=%5.3f, mo=%5.3f, 
q=%5.3f' % tuple(popt))
plt.plot(x,y,':', label='Daten')
plt.grid(True)
plt.legend(loc='upper left')
plt.xlabel("t [s]")
plt.ylabel("z [m]")
plt.title('Anpassung vor Zeitpunkt T', )
plt.savefig('fit1.pdf')
plt.show()

Here is the fit for this line of code: Fit1

and the covariance Matrix:

[[ 3.66248820e+09  2.88800781e+09 -5.59803683e+06 -4.01121935e+05]
 [ 2.88800781e+09  2.27731332e+09 -4.44058731e+06 -3.17108449e+05]
 [-5.59803683e+06 -4.44058731e+06  2.43805434e+05  7.83731345e+03]
 [-4.01121935e+05 -3.17108449e+05  7.83731345e+03  2.65778118e+02]]

I also tried the following fit mode but I become errors of over 1400%:

fmodel = Model(func)
result = fmodel.fit(y, x=x, a=14, b=3.9, mo=0.8, q=0.002)

This is the fit report:

a:   926.607518 +/- 182751.047 (19722.59%) (init = 14)
b:   737.755741 +/- 143994.520 (19517.91%) (init = 3.9)
mo:  0.27745681 +/- 27.5360933 (9924.46%) (init = 0.8)
q:   0.00447098 +/- 0.60437392 (13517.72%) (init = 0.002)

And this is the resulting fit: Fit2 I would really appreciate some help. If possible a simple guide on how to minimize the error of the function!

The data looks like this:

x=[0.0333 0.0667 0.1    0.133  0.167  0.2    0.233  0.267  0.3    0.333  
   0.367  0.4    0.433  0.467  0.5    0.533  0.567  0.6   ]
y=[0.104 0.249 0.422 0.6   0.791 1.    1.23  1.47  1.74  2.02  2.33  2.64
   2.99  3.34  3.71  4.08  4.47  4.85 ]

Thank you!

Manuel Claros
  • 31
  • 1
  • 1
  • 5
  • `pcov` is the covariance matrix. The errors of each parameter are the values on the diagonal. – Ardweaden Jul 25 '19 at 09:49
  • Thank you ardweaden! Did not know that, but I checked and the errors are still crazily high – Manuel Claros Jul 25 '19 at 09:56
  • To clarify, the values are errors _squared_. Anyhow, how does your data look like? Also, you could reduce the number of parameters by making `mo/q` a single parameter, which might help. – Ardweaden Jul 25 '19 at 10:04
  • Still the errors are of the order of e+09 also still way to big. setting mo/q makes the error a bit better but its still over 1000%. I added the data to the post – Manuel Claros Jul 25 '19 at 10:15
  • Looking at your covariance matrix, values are large everywhere. That shouldn't happen, ideally all non-diagonal elements would be zero. This implies your parameters are strongly correlated, which usually means some are obsolete. I assume this is down mainly to the complexity of your model (if it was simpler, the parameter errors would be smaller, because a much narrower set of values would fit closely. For example, using a simple linear function would leave a much larger residual, but the best parameters would be quite tightly determined). Ultimately, I can't help you. – Ardweaden Jul 25 '19 at 10:53

2 Answers2

1

If you had printed out the full fit report from lmfit (or properly untangled to components of the covariance matrix from curve_fit) you would see that the parameters a and b are 100% correlated.

Basically, this is the fitting algorithm telling you that your data is not described well by you model and that you don't need that many parameters (or perhaps these parameters and this model) to describe your data.

Indeed, if you plot the data, there is a gentle slope up. Your function is the sum of three different terms:

   (4.9+a*x+b)*x
 + (a*x+b)*((mo/q)-x)*log(1-x/(mo/q)) 
 - 0.5*9.8*x*x

There are a couple of things to note:

  1. mo and q only appear together, and as mo/q. They will not be independent.
  2. a and b only appear together, and in the same form in multiple places
  3. there are 2 purely quadratic terms in x, one of them hardwired.
  4. the logarithmic term also has a quadratic prefactor. Importantly, the x data does not vary by more than 1 order of magnitude, so that log term won't really vary by much, giving mostly a third quadratic term. (as an aside: if you're taking logs, you should ensure that the argument is actually positive -- log(1-x*a) is asking for trouble)

To summarize: your model is too complicated for your data.

I very much doubt you need to log term at all. As it turns out, you can get a pretty good fit with simple parabolic model:

import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import ParabolicModel

x = np.array([0.0333, 0.0667, 0.1, 0.133, 0.167, 0.2, 0.233 , 0.267, 0.3 ,
          0.333, 0.367, 0.4, 0.433, 0.467, 0.5, 0.533 , 0.567 , 0.6 ])

y = np.array([0.104, 0.249 , 0.422, 0.6, 0.791, 1.0, 1.23, 1.47, 1.74,
          2.02, 2.33, 2.64, 2.99, 3.34, 3.71, 4.08, 4.47, 4.85 ])

qmodel = ParabolicModel()
result = qmodel.fit(y, x=x, a=1, b=2, c=0)
print(result.fit_report())
fitlabel = "fit: a=%5.3f, b=%5.3f, c=%5.3f" % (result.params['a'].value,
                                               result.params['b'].value,
                                               result.params['c'].value)
plt.plot(x, y, label='Daten')
plt.plot(x, result.best_fit, label=fitlabel)
plt.xlabel("t [s]")
plt.ylabel("z [m]")
plt.legend(loc='upper left')
plt.title("Anpassung vor Zeitpunkt T (Model: a*x^2+b*x+c)")
plt.show()

which will give a report of

[[Model]]
    Model(parabolic)
[[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 9
    # data points      = 18
    # variables        = 3
    chi-square         = 0.00298906
    reduced chi-square = 1.9927e-04
    Akaike info crit   = -150.657052
    Bayesian info crit = -147.985936
[[Variables]]
    c: -0.02973853 +/- 0.01120090 (37.66%) (init = 0)
    b:  3.67707491 +/- 0.08142567 (2.21%) (init = 2)
    a:  7.51540814 +/- 0.12492370 (1.66%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
    C(b, a) = -0.972
    C(c, b) = -0.891
    C(c, a) =  0.785

and a plot of enter image description here

M Newville
  • 7,486
  • 2
  • 16
  • 29
  • Good point to notice the `log(1-x)` trouble. Actually the `c` should be even skipped, as the presented model forces `y = 0` for `x = 0` . One sees the entire correlation trouble when Taylor expanding the log-term first order. Many terms drop out and the only remaining term having m0/q is third order. As the data is very parabolic, this is an overkill and the comment on a too complex model is definitively correct. – mikuszefski Jul 25 '19 at 12:50
  • Thank you! I kind of felt that the model was complicated from the beginning. The problem was that I had to use the model to prove that a theoretical derivation of the formula coincides with the data. Ultimately I solved the problem with two steps: 1) I could determine a value for "mo". 2) It was first assumed that the velocity grew linearly with time but actually it was constant. Thank you for taking the time for answering my question. I will post my complete code as an answer! – Manuel Claros Jul 25 '19 at 17:25
1

The data and the formula came from a physics experiment so I had to use the function. I could determine the actual value of mo and I also had to think about the meaning of the different variables. First in the formula "vo" and "ve" have to be equal, because the data "starts" after a period of acceleration and also because in the original fit "ve" was assumed to be a linear function but physically it should be a constant velocity. That's why in the following code vo+ve=ve. That left the fit function with two parameters which could be determined easily and with a small error.

def func(x, ve, q):
    return (2*ve*x) + ve*((0.65/q) - x)*log(1 - (q*x)/0.65) - 0.5*9.8*x*x

popt, pcov = curve_fit(func,x,y,bounds=((-100, 0), (1500, 1.5)))
plt.plot(x, func(x, *popt), color='orange', label='Anpassung: $v_e$=%5.3f, q=%5.3f' % 
tuple(popt))
plt.plot(x,y,':', label='Daten')

print(pcov)

plt.grid(True)
plt.legend(loc='upper left')
plt.xlabel("t [s]")
plt.ylabel("z [m]")
plt.title('Anpassung vor Zeitpunkt T', )
plt.savefig('anpass.pdf')
plt.savefig('anpass.png')
plt.show()

Which gives the covariance matrix:

[[ 0.01807039 -0.00305967]
 [-0.00305967  0.00065509]]

And the Fit:

Fitted Function

Which are realistic values for the function.

Thank you for your help and I hope this post is useful for someone someday!

Manuel Claros
  • 31
  • 1
  • 1
  • 5
  • Well, as a physicist I will insist that you do not have to use any functional form [citation: Occam]. But, I get the constraints of the assignment. If you fix some of the values for the parameters, you won't see the correlations with the varying parameters. That doesn't mean they're not there. Above, someone said that correlations would ideally be 0. That's not realistic (slope and intercept of a line are *always* correlated). The key is for the uncertainties reflect those correlations. For your case the uncertainties diverge exactly as they should when the correlations go to 1. Good! – M Newville Jul 26 '19 at 01:18