The curve fitting runs smoothly when we provide a good starting point. We can get one
- by linear regression on
sp.norm.ppf(x_data)
and np.log(y_data)
- or by fitting the free (non-clipped) model first
Alternatively, if you want the computer to find the solution without "help"
- use a stochastic algorithm like basin hopping (it's unfortunately not 100% autonomous, I had to increase the step size from its default value)
- brute force it (this requires the user to provide a search grid)
All four methods yield the same result and it is better than the excel result.
import numpy as np
import scipy.stats as sp
from scipy.optimize import curve_fit, basinhopping, brute
def func1(x, n1, n2):
return np.clip(np.exp(n2 + n1*(sp.norm.ppf(x))),25e4,10e6)
def func_free(x, n1, n2):
return np.exp(n2 + n1*(sp.norm.ppf(x)))
def sqerr(n12, x, y):
return ((func1(x, *n12) - y)**2).sum()
x_data = np.array((1e-04,9e-01,9.5835e-01,9.8e-01,9.9e-01,9.9e-01))
y_data = np.array((250e3,1e6,2.5e6,5e6,7.5e6,10e6))
# get a good starting point
# either by linear regression
lin = sp.linregress(sp.norm.ppf(x_data), np.log(y_data))
# or by using the free (non-clipped) version of the formula
(n1f, n2f), infof = curve_fit(func_free, x_data, y_data, (1, 1))
# use those on the original problem
(n1, n2), info = curve_fit(func1, x_data, y_data, (lin.slope, lin.intercept))
(n12, n22), info2 = curve_fit(func1, x_data, y_data, (n1f, n2f))
# OR
# use basin hopping
hop = basinhopping(sqerr, (1, 1), minimizer_kwargs=dict(args=(x_data, y_data)), stepsize=10)
# OR
# brute force it
brt = brute(sqerr, ((-100, 100), (-100, 100)), (x_data, y_data), 201, full_output=True)
# all four solutions are essentially the same:
assert np.allclose((n1, n2), (n12, n22))
assert np.allclose((n1, n2), hop.x)
assert np.allclose((n1, n2), brt[0])
# we are actually a bit better than excel
n1excel, n2excel = 1.7925, 11.6771
print('solution', n1, n2)
print('error', ((func1(x_data, n1, n2) - y_data)**2).sum())
print('excel', ((func1(x_data, n1excel, n2excel) - y_data)**2).sum())
Output:
solution 2.08286042997 11.1397332743
error 3.12796761241e+12
excel 5.80088578059e+12
Remark: One easy optimization - which I left out for simplicity and because things are fast enough anyway - would have been to pull sp.norm.ppf
out of the model function. This is possible because it does not depend on the fit parameters. So when any of our solvers calls the function it always does the exact same computation - sp.norm.ppf(x_data)
- first, so we might as well have precomputed it.
This observation is also our reason for using sp.norm.ppf(x_data)
in the linear regression.