24

I'm trying to fit the distribution of some experimental values with a custom probability density function. Obviously, the integral of the resulting function should always be equal to 1, but the results of simple scipy.optimize.curve_fit(function, dataBincenters, dataCounts) never satisfy this condition. What is the best way to solve this problem?

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Axon
  • 547
  • 2
  • 6
  • 14

5 Answers5

22

You can define your own residuals function, including a penalization parameter, like detailed in the code below, where it is known beforehand that the integral along the interval must be 2.. If you test without the penalization you will see that what your are getting is the conventional curve_fit:

enter image description here

import matplotlib.pyplot as plt
import scipy
from scipy.optimize import curve_fit, minimize, leastsq
from scipy.integrate import quad
from scipy import pi, sin

x = scipy.linspace(0, pi, 100)
y = scipy.sin(x) + (0. + scipy.rand(len(x))*0.4)
def func1(x, a0, a1, a2, a3):
    return a0 + a1*x + a2*x**2 + a3*x**3

# here you include the penalization factor
def residuals(p, x, y):
    integral = quad(func1, 0, pi, args=(p[0], p[1], p[2], p[3]))[0]
    penalization = abs(2.-integral)*10000
    return y - func1(x, p[0], p[1], p[2], p[3]) - penalization

popt1, pcov1 = curve_fit(func1, x, y)
popt2, pcov2 = leastsq(func=residuals, x0=(1., 1., 1., 1.), args=(x, y))
y_fit1 = func1(x, *popt1)
y_fit2 = func1(x, *popt2)
plt.scatter(x, y, marker='.')
plt.plot(x, y_fit1, color='g', label='curve_fit')
plt.plot(x, y_fit2, color='y', label='constrained')
plt.legend()
plt.xlim(-0.1, 3.5)
plt.ylim(0, 1.4)
print('Exact integral:', quad(sin, 0, pi)[0])
print('Approx integral1:', quad(func1, 0, pi, args=(popt1[0], popt1[1], popt1[2], popt1[3]))[0])
print('Approx integral2:', quad(func1, 0, pi, args=(popt2[0], popt2[1], popt2[2], popt2[3]))[0])
plt.show()

#Exact   integral: 2.0
#Approx integral1: 2.60068579748
#Approx integral2: 2.00001911981

Other related questions:

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • @Axon Which warning? It would be nice if you could paste your code somewhere on the web... – Saullo G. P. Castro May 20 '13 at 09:31
  • I tried this method, but in this case I get this warning: http://dpaste.org/NfLwy/, and the resulting fitted curve doesn't even nearly resemble the distribution. [scipy.optimize.curve_fit](http://storage8.static.itmages.ru/i/13/0520/h_1369043990_1772962_4ff4c2e582.png) [with penalization](http://storage1.static.itmages.ru/i/13/0520/h_1369044034_5072385_fce21d00cc.png) – Axon May 20 '13 at 10:01
  • @Axon This is an integration error. I'am checking here, but you can try another penalization factor `10000` and see what happens. You can also change the initial guess=(1.,1.,1.,1.) to another attempt – Saullo G. P. Castro May 20 '13 at 10:10
  • @Axon [This answer](http://stackoverflow.com/a/16651524/832621) or [this answer](http://stackoverflow.com/a/16651955/832621) may give you some insight about how to fit a distribution function – Saullo G. P. Castro May 20 '13 at 14:43
  • Is there any reason why not to use `popt1, pcov1 = curve_fit( func1withpenalization, x, y )`, where the definition of `func1withpenalization` includes the penalization, e.g,: `def func1withpenalization(p,x,y): integral = quad( func1, 0, pi, args=(p[0],p[1],p[2],p[3]))[0]; penalization = abs(2.-integral)*10000; return y - func1(x, p[0],p[1],p[2],p[3]) - penalization` – altroware Jan 12 '15 at 10:53
  • 1
    @altroware no special reason, but since `curve_fit` is a Python wrapper around `leastsq` I preferred to use the latter... but It would nice to have a new answer with `curve_fit` ;) – Saullo G. P. Castro Jan 12 '15 at 11:13
11

Here is an almost-identical snippet which makes only use of curve_fit.

import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
import scipy.integrate as integr


x = np.linspace(0, np.pi, 100)
y = np.sin(x) + (0. + np.random.rand(len(x))*0.4)

def Func(x, a0, a1, a2, a3):
    return a0 + a1*x + a2*x**2 + a3*x**3

# modified function definition with Penalization
def FuncPen(x, a0, a1, a2, a3):
    integral = integr.quad( Func, 0, np.pi, args=(a0,a1,a2,a3))[0]
    penalization = abs(2.-integral)*10000
    return a0 + a1*x + a2*x**2 + a3*x**3 + penalization


popt1, pcov1 = opt.curve_fit( Func, x, y )
popt2, pcov2 = opt.curve_fit( FuncPen, x, y )

y_fit1 = Func(x, *popt1)
y_fit2 = Func(x, *popt2)

plt.scatter(x,y, marker='.')
plt.plot(x,y_fit2, color='y', label='constrained')
plt.plot(x,y_fit1, color='g', label='curve_fit')
plt.legend(); plt.xlim(-0.1,3.5); plt.ylim(0,1.4)
print 'Exact   integral:',integr.quad(np.sin ,0,np.pi)[0]
print 'Approx integral1:',integr.quad(Func,0,np.pi,args=(popt1[0],popt1[1],
                                                popt1[2],popt1[3]))[0]
print 'Approx integral2:',integr.quad(Func,0,np.pi,args=(popt2[0],popt2[1],
                                                popt2[2],popt2[3]))[0]
plt.show()

#Exact   integral: 2.0
#Approx integral1: 2.66485028754
#Approx integral2: 2.00002116217

enter image description here

wandadars
  • 1,113
  • 4
  • 19
  • 37
altroware
  • 940
  • 3
  • 13
  • 22
3

Following the example above here is more general way to add any constraints:

from scipy.optimize import minimize
from scipy.integrate import quad
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, np.pi, 100)
y = np.sin(x) + (0. + np.random.rand(len(x))*0.4)

def func_to_fit(x, params):
    return params[0] + params[1] * x + params[2] * x ** 2 + params[3] * x ** 3

def constr_fun(params):
    intgrl, _ = quad(func_to_fit, 0, np.pi, args=(params,))
    return intgrl - 2

def func_to_minimise(params, x, y):
    y_pred = func_to_fit(x, params)
    return np.sum((y_pred - y) ** 2)

# Do the parameter fitting
#without constraints
res1 = minimize(func_to_minimise, x0=np.random.rand(4), args=(x, y))
params1 = res1.x
# with constraints
cons = {'type': 'eq', 'fun': constr_fun}
res2 = minimize(func_to_minimise, x0=np.random.rand(4), args=(x, y), constraints=cons)
params2 = res2.x

y_fit1 = func_to_fit(x, params1)
y_fit2 = func_to_fit(x, params2)

plt.scatter(x,y, marker='.')
plt.plot(x, y_fit2, color='y', label='constrained')
plt.plot(x, y_fit1, color='g', label='curve_fit')
plt.legend(); plt.xlim(-0.1,3.5); plt.ylim(0,1.4)
plt.show()
print(f"Constrant violation: {constr_fun(params1)}")

Constraint violation: -2.9179325622408214e-10

1

If you are able normalise your probability fitting function in advance then you can use this information to constrain your fit. A very simple example of this would be fitting a Gaussian to data. If one were to fit the following three-parameter (A, mu, sigma) Gaussian then it would be unnormalised in general:

Gaussian

however, if one instead enforces the normalisation condition on A:

Normalised

then the Gaussian is only two parameter and is automatically normalised.

Mead
  • 383
  • 4
  • 19
0

You could ensure that your fitted probability distribution is normalised via a numerical integration. For example, assuming that you have data x and y and that you have defined an unnormalised_function(x, a, b) with parameters a and b for your probability distribution, which is defined on the interval x1 to x2 (which could be infinite):

from scipy.optimize import curve_fit
from scipy.integrate import quad

# Define a numerically normalised function
def normalised_function(x, a, b):
    normalisation, _ = quad(lambda x: unnormalised_function(x, a, b), x1, x2)
    return unnormalised_function(x, a, b)/normalisation

# Do the parameter fitting
fitted_parameters, _ = curve_fit(normalised_function, x, y)
Mead
  • 383
  • 4
  • 19