4

The curve is:

import numpy as np
import scipy.stats as sp
from scipy.optimize import curve_fit
from lmfit import minimize, Parameters, Parameter, report_fit#
import xlwings as xw
import os
import pandas as pd

I tried running a simply curve fit from scipy: This returns

Out[156]: 
(array([ 1.,  1.]), array([[ inf,  inf],
        [ inf,  inf]]))

Ok, so I thought I need to start bounding my parameters, which is what solver does. I used lmfit to do this:

params = Parameters()

params.add

I tried to nudge python to the right direction by changing the starting parameters:

or using curvefit:

user33484
  • 740
  • 2
  • 9
  • 36
  • Why do you use the `minimum/maximum` in `func1`? Something which might also be worth trying is to take the log of `y_data` (check [this question](https://stackoverflow.com/questions/35903506/scipy-curve-fit-fails-power-law/35905257#35905257) for an example). – Cleb Jan 12 '18 at 08:52
  • That's the function I'd like to fit to - I am just trying to replicate excel. – user33484 Jan 12 '18 at 11:15
  • I understand that this is the function you are trying to fit, just curious whether this formulation could lead to issues regarding the optimization. Have you tried it without all this `min/max` stuff? Did you use exactly the same formulation in excel or just the `np.exp(n2 + n1*(sp.norm.ppf(x))` part? – Cleb Jan 12 '18 at 11:29
  • @Cleb Hi Cleb - you're right it is the min/max which throws it off. The fit message I receive stats that "one or more variables did not affect the fit". I believe the algorithm changes the two variables, but the min/max forces the output to be either 250e3 or 10e6 and the algorithm gives up in a way because the change in the parameter did not lead to a change in the function. This is what I think is happening. I used the exact same formualtion in excel, I had the min/max of that np.exp(n2 + n1*(sp.norm.ppf(x)) part just as above - and minimised the least squared error – user33484 Jan 12 '18 at 11:35
  • I do need a way around it however. I'd want it so that it won't just "give-up" and instead keep trying until it finds some movement. I'm not sure how I can force it to keep going. Perhaps a brute force method? – user33484 Jan 12 '18 at 11:36
  • Note that fitting without min/max leads to desirable results. – user33484 Jan 12 '18 at 11:37
  • It is still unclear to me what you are trying to achieve: i) if the removal of the `min/max` gives desirable results as you write, why do you want to keep it in? ii) What do you expect the algorithm to do when you use the `min/max` stuff? – Cleb Jan 12 '18 at 11:40
  • @Cleb A fit without min/max gives desirable results only for that function which doesnt have min/max (i.e. an inverse lognormal). I need to fit the function I have defined, not the one without min/max - the minmax is effectively a capping of the function and I'd like it to get a good fit around that capping. ii) I expect the algorithm to give me paramaters which are a closer fit to the function I've defined. Running it through excel gets a better fit, so I want to know how I can get this to give a better fit than it looks like now (where it gives up). – user33484 Jan 12 '18 at 11:43
  • 1
    Then I would better design a function that is automatically constrained in the desired range rather than introducing the `min/max` operations. How exactly this would like in this case, I don't know though ;) – Cleb Jan 12 '18 at 11:47
  • I don't understand how that would make a difference - the function would output the same figure. – user33484 Jan 12 '18 at 11:51
  • @user33484 the excel fit doesn't look better to me. It would appear to have a larger squared error than the scipy fit regardless of whether we clip the model or not. – Paul Panzer Jan 16 '18 at 14:47
  • The min/max requirements appear to restraints imposed by Excel. Are you possibly projecting those constraints onto Python? Is not your ultimate goal to fit data points and estimate the fitting parameters? – pylang Jan 17 '18 at 03:41

3 Answers3

1

I think that part of the problem is that you have only 5 observations, 2 at the same value of x and the model does not perfectly represent your data. I also recommend trying to fit in the log of the model to the log of the data. And, if you expect n2 to be ~10, you should use that as a starting value.

Arbitrarily applying min and max to the calculated model, especially with values so close to your data range, makes very little sense. Doing this will prevent the fit method from exploring the effect of changing the parameter values on the model function.

With some modifications to your code to omit the first data point (which seems to be throwing off the model), I find this:

import numpy as np
import scipy.stats as sp
from lmfit import Parameters, minimize, fit_report

import matplotlib.pyplot as plt

x_data = np.array((1e-04,9e-01,9.5835e-01,9.8e-01,9.9e-01,9.9e-01))
y_data = np.array((250e3,1e6,2.5e6,5e6,7.5e6,10e6))

x_data = x_data[1:]
y_data = y_data[1:]

def func(params, x, data, m1=250e3,m2=10e6):
    n1 = params['n1'].value
    n2 = params['n2'].value

    model = n2 + n1*(sp.norm.ppf(x))
    return np.log(data) - model

params = Parameters()

params.add('n1', value= 1, min=0.01, max=20)
params.add('n2', value= 10, min=0, max=20)

result = minimize(func, params, args=(x_data[1:-1], y_data[1:-1]))

print(fit_report(result))
n1 = result.params['n1'].value
n2 = result.params['n2'].value 
ym = np.exp(n2 + n1*(sp.norm.ppf(x_data)))

plt.plot(x_data, y_data, 'o')
plt.plot(x_data, ym, '-')
plt.legend(['data', 'fit'])
plt.show()

gives a report of

[[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 15
    # data points      = 3
    # variables        = 2
    chi-square         = 0.006
    reduced chi-square = 0.006
    Akaike info crit   = -14.438
    Bayesian info crit = -16.241
[[Variables]]
    n1:   1.85709072 +/- 0.190473 (10.26%) (init= 1)
    n2:   11.5455736 +/- 0.390805 (3.38%) (init= 10)
[[Correlations]] (unreported correlations are <  0.100)
    C(n1, n2)                    = -0.993

and a plot of enter image description here

M Newville
  • 7,486
  • 2
  • 16
  • 29
  • 1
    Thanks for your answer. I understand that it looks odd to have the maximum and minimum, but they are needed in this case. The function must have this maximum and minimum. The issue I do not get is, why can excel find a solution with the maximum and minimum applied but python seems to fail? I understand why python is failing as it is restricting the movement in the parameters but excel seems to be able to do it! – user33484 Jan 12 '18 at 13:27
  • 1
    Um, what makes you think that they are necessary? The function you are using most certainly does *not* have this max/min. You are imposing them. And your data values fall *exactly* on these bounds. The fit algorithm *will* have a very hard time satisfying your arbitrary bounds. Python is *not* failing. – M Newville Jan 12 '18 at 14:21
  • By necessary I mean I am trying to fit to the function with the maximum and the minimum attached - I need to fit to these bounds which was my original question. Fitting without the bounds works absolutely fine, but that's not what I am trying to fit to. My question is how does excel solver seem to find some parameters which fit this condition but python seems to "give up" and give me my default parameters. I would like to find a way to get python to keep trying until it finds suitable parameters which, by excel I know do exist. – user33484 Jan 12 '18 at 14:27
  • I agree with @user33484, this is a very very lazy attempt at an answer. Not in the spirit of the stack community. Sad! – The Problem Jan 12 '18 at 18:34
  • Uh, well the question is a classic X-Y question, asking about the attempted (and not very well reasoned) solution, and not about the problem. First, fit in log-space, second be mindful of the small data set, and third recognize that setting tight bounds on the model (especially at data values) will *prevent* a thorough explanation of parameter space. The question also includes "why doesn't it give the answer from ?". I have no idea what that other approach does. Is that sad? – M Newville Jan 12 '18 at 19:22
1

The curve fitting runs smoothly when we provide a good starting point. We can get one

  • by linear regression on sp.norm.ppf(x_data) and np.log(y_data)
  • or by fitting the free (non-clipped) model first

Alternatively, if you want the computer to find the solution without "help"

  • use a stochastic algorithm like basin hopping (it's unfortunately not 100% autonomous, I had to increase the step size from its default value)
  • brute force it (this requires the user to provide a search grid)

All four methods yield the same result and it is better than the excel result.

import numpy as np
import scipy.stats as sp
from scipy.optimize import curve_fit, basinhopping, brute

def func1(x, n1, n2):
    return np.clip(np.exp(n2 + n1*(sp.norm.ppf(x))),25e4,10e6)

def func_free(x, n1, n2):
    return np.exp(n2 + n1*(sp.norm.ppf(x)))

def sqerr(n12, x, y):
    return ((func1(x, *n12) - y)**2).sum()

x_data = np.array((1e-04,9e-01,9.5835e-01,9.8e-01,9.9e-01,9.9e-01))
y_data = np.array((250e3,1e6,2.5e6,5e6,7.5e6,10e6))

# get a good starting point

# either by linear regression
lin = sp.linregress(sp.norm.ppf(x_data), np.log(y_data))

# or by using the free (non-clipped) version of the formula
(n1f, n2f), infof = curve_fit(func_free, x_data, y_data, (1, 1))

# use those on the original problem 
(n1, n2), info = curve_fit(func1, x_data, y_data, (lin.slope, lin.intercept))
(n12, n22), info2 = curve_fit(func1, x_data, y_data, (n1f, n2f))

# OR

# use basin hopping
hop = basinhopping(sqerr, (1, 1), minimizer_kwargs=dict(args=(x_data, y_data)), stepsize=10)

# OR 

# brute force it
brt = brute(sqerr, ((-100, 100), (-100, 100)), (x_data, y_data), 201, full_output=True)

# all four solutions are essentially the same:
assert np.allclose((n1, n2), (n12, n22))
assert np.allclose((n1, n2), hop.x)
assert np.allclose((n1, n2), brt[0])

# we are actually a bit better than excel
n1excel, n2excel = 1.7925, 11.6771

print('solution', n1, n2)
print('error', ((func1(x_data, n1, n2) - y_data)**2).sum())
print('excel', ((func1(x_data, n1excel, n2excel) - y_data)**2).sum())

Output:

solution 2.08286042997 11.1397332743
error 3.12796761241e+12
excel 5.80088578059e+12

Remark: One easy optimization - which I left out for simplicity and because things are fast enough anyway - would have been to pull sp.norm.ppf out of the model function. This is possible because it does not depend on the fit parameters. So when any of our solvers calls the function it always does the exact same computation - sp.norm.ppf(x_data) - first, so we might as well have precomputed it.

This observation is also our reason for using sp.norm.ppf(x_data) in the linear regression.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • This is perfect, thank you! Could you ask why you used sp.norm.ppf(x_data) for the regression? – user33484 Jan 16 '18 at 22:38
  • @user33484 You are welcome. The reason is that by "pulling this out" of your model function (and by taking the log from the other end) we are actually left with a linear function: If `y^ = n2 + n1 * x^` with `y^ = log(y_data)` and `x^ = normppf(x_data)` then y_data = exp(n2 + n1 * normppf(x_data))`. – Paul Panzer Jan 16 '18 at 22:49
0

Here is a simple way to perform regression using scipy.optimize.curve_fit:

import matplotlib.pyplot as plt
import scipy.optimize as opt
import scipy.stats as stats
import numpy as np

% matplotlib inline


# Objective
def model(x, n1, n2):
    return np.exp(n2 + n1*(stats.norm.ppf(x)))

# Data
x_samp = np.array((1e-04, 9e-01, 9.5835e-01, 9.8e-01, 9.9e-01, 9.9e-01))
y_samp = np.array((250e3, 1e6, 2.5e6, 5e6 ,7.5e6, 10e6))

x_lin = np.linspace(min(x_samp), max(x_samp), 50)  # for fitting

# Regression
p0 = [5, 5]                                        # guessed params
w, cov = opt.curve_fit(model, x_samp, y_samp, p0=p0)     
print("Estimated Parameters", w)  
y_fit = model(x_lin, *w)

# Visualization
plt.plot(x_samp, y_samp, "ko", label="Data")
plt.plot(x_lin, y_fit, "k--", label="Fit")
plt.title("Curve Fitting")
plt.legend(loc="upper left")

Output

Estimated Parameters [  2.08285048  11.13975585]

enter image description here


Details

Your data was plotted as-is, without transformations. It is easiest to perform such a regression if the model and initial parameters, p0 reasonably fit your data. When these items are supplied to scipy.optimize.curve_fit, a tuple of the weights or optimized estimated parameters, w are returned, along with a covariance matrix. We can compute one standard deviation from the diagonals of the matrix:

p_stdev = np.sqrt(np.diag(cov)) 
print("Std. Dev. of Params:", p_stdev)
# Std. Dev. of Params: [ 0.42281111  0.95945127]

We visually assess the goodness of fit by supplying these estimated parameters once more to the model and plotting the fitted line.

pylang
  • 40,867
  • 14
  • 129
  • 121