17

My question involves statistics and python and I am a beginner in both. I am running a simulation, and for each value for the independent variable (X) I produce 1000 values for the dependent variable (Y). What I have done is that I calculated the average of Y for each value of X and fitted these averages using scipy.optimize.curve_fit. The curve fits nicely, but I want to draw also the confidence intervals. I am not sure if what I am doing is correct or if what I want to do can be done, but my question is how can I get the confidence intervals from the covariance matrix produced by curve_fit. The code reads the averages from files first then it just simply uses curve_fit.

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit


def readTDvsTx(L, B, P, fileformat):
    # L should be '_Fixed_' or '_'
    TD = []
    infile = open(fileformat.format(L, B, P), 'r')
    infile.readline()  # To remove header
    for line in infile:
        l = line.split()  # each line contains TxR followed by CD followed by TD
        if eval(l[0]) >= 70 and eval(l[0]) <=190:
            td = eval(l[2])
            TD.append(td)
    infile.close()
    tdArray = np.array(TD)

    return tdArray


def rec(x, a, b):
    return a * (1 / (x**2)) + b



fileformat = 'Densities_file{}BS{}_PRNTS{}.txt'
txR = np.array(range(70, 200, 20))
parents = np.array(range(1,6))
disc_p1 = readTDvsTx('_Fixed_', 5, 1, fileformat)


popt, pcov = curve_fit(rec, txR, disc_p1)


plt.plot(txR, rec(txR, popt[0], popt[1]), 'r-')
plt.plot(txR, disc_p1, '.')

print(popt)
plt.show()

And here is the resulting fit: enter image description here

MrLeeh
  • 5,321
  • 6
  • 33
  • 51
osmak
  • 315
  • 1
  • 2
  • 10
  • The kmpfit module can calculate the confidence band when fitting a non-linear function, see this [answer](http://stackoverflow.com/a/37080916/1628638) of mine. You will need to use all points for the fitting, not just the averages. – Ulrich Stern Sep 20 '16 at 20:43
  • PS: if you want to do the confidence band calculation yourself, my comment on the answer has a link (to [this page](http://www.graphpad.com/guides/prism/7/curve-fitting/index.htm?reg_how_confidence_and_prediction_.htm)). – Ulrich Stern Sep 20 '16 at 20:54
  • It isn't that trivial to use all the points for the fitting because osmak's function is multivariate. – Vlas Sokolov Sep 27 '16 at 16:00
  • Thank you all for your comments. The thing is I think I have misinterpreted the way I get my values. In my simulation, I search for a certain density which I call Target Density or TD for short. The way I do it is that I run 1000 simulation instances and check the average for those using some criterion, which if satisfied, indicates that I have reached my TD. Increasing the value of the independent variable will not affect the TD, i.e., it is not normally distributed. – osmak Sep 28 '16 at 15:48

1 Answers1

27

Here's a quick and wrong answer: you can approximate the errors from the covariance matrix for your a and b parameters as the square root of its diagonals: np.sqrt(np.diagonal(pcov)). The parameter uncertainties can then be used to draw the confidence intervals.

The answer is wrong because you before you fit your data to a model, you'll need an estimate of the errors on your averaged disc_p1 points. When averaging, you have lost the information about the scatter of the population, leading curve_fit to believe that the y-points you feed it are absolute and undisputable. This might cause an underestimation of your parameter errors.

For an estimate of the uncertainties of your averaged Y values, you need to estimate their dispersion measure and pass it along to curve_fit while saying that your errors are absolute. Below is an example of how to do this for a random dataset where each of your points consists of a 1000 samples drawn from a normal distribution.

from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np

# model function
func = lambda x, a, b: a * (1 / (x**2)) + b 

# approximating OP points
n_ypoints = 7 
x_data = np.linspace(70, 190, n_ypoints)

# approximating the original scatter in Y-data
n_nested_points = 1000
point_errors = 50
y_data = [func(x, 4e6, -100) + np.random.normal(x, point_errors,
          n_nested_points) for x in x_data]

# averages and dispersion of data
y_means = np.array(y_data).mean(axis = 1)
y_spread = np.array(y_data).std(axis = 1)

best_fit_ab, covar = curve_fit(func, x_data, y_means,
                               sigma = y_spread,
                               absolute_sigma = True)
sigma_ab = np.sqrt(np.diagonal(covar))

from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)

# plotting the unaveraged data
flier_kwargs = dict(marker = 'o', markerfacecolor = 'silver',
                    markersize = 3, alpha=0.7)
line_kwargs = dict(color = 'k', linewidth = 1)
bp = plt.boxplot(y_data, positions = x_data,
                 capprops = line_kwargs,
                 boxprops = line_kwargs,
                 whiskerprops = line_kwargs,
                 medianprops = line_kwargs,
                 flierprops = flier_kwargs,
                 widths = 5,
                 manage_ticks = False)
# plotting the averaged data with calculated dispersion
#plt.scatter(x_data, y_means, facecolor = 'silver', alpha = 1)
#plt.errorbar(x_data, y_means, y_spread, fmt = 'none', ecolor = 'black')

# plotting the model
hires_x = np.linspace(50, 190, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
                 color = 'black', alpha = 0.15)
plt.text(140, 800, text_res)
plt.xlim(40, 200)
plt.ylim(0, 1000)
plt.show()

absolutely weighted least squares

Edit: If you are not considering the intrinsic errors on the data points, you are probably fine with using the "qiuck and wrong" case I mentioned before. The square root of the diagonal entries of covariance matrix can then be used to calculate your confidence intervals. However, note that the confidence intervals have shrunk now that we've dropped the uncertainties:

from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np

func = lambda x, a, b: a * (1 / (x**2)) + b

n_ypoints = 7
x_data = np.linspace(70, 190, n_ypoints)

y_data = np.array([786.31, 487.27, 341.78, 265.49,
                    224.76, 208.04, 200.22])
best_fit_ab, covar = curve_fit(func, x_data, y_data)
sigma_ab = np.sqrt(np.diagonal(covar))

# an easy way to properly format parameter errors
from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)

plt.scatter(x_data, y_data, facecolor = 'silver',
            edgecolor = 'k', s = 10, alpha = 1)

# plotting the model
hires_x = np.linspace(50, 200, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
                 color = 'black', alpha = 0.15)
plt.text(140, 630, text_res)
plt.xlim(60, 200)
plt.ylim(0, 800)
plt.show()

no-sigma-case

If you're unsure whether to include the absolute errors or how to estimate them in your case, you'd be better off asking for advice at Cross Validated, as Stack Overflow is mainly for discussion on implementations of regression methods and not for discussion on the underlying statistics.

Vlas Sokolov
  • 3,733
  • 3
  • 26
  • 43
  • Thanks for your answer. The thing is I think I have misinterpreted the way I get my values. In my simulation, I search for a certain density which I call Target Density or TD for short. The way I do it is that I run 1000 simulation instances and check the average for those using some criterion, which if satisfied, indicates that I have reached my TD. Increasing the value of the independent variable will not affect the TD, i.e., it is not normally distributed. – osmak Sep 28 '16 at 15:45
  • So the converged TD values come without any uncertainty then? – Vlas Sokolov Sep 28 '16 at 16:10
  • Its not that they don't come without any uncertainty, they are more like limits. I search for the lowest TD (the value of the independent variable) that satisfies a certain criterion, i.e., increasing it will also satisfy the criterion. If I repeat the search for a certain configuration (which may take a couple of days of execution) I usually get the same limit plus/minus 10, but this is not feasible to do because it is very time consuming which makes it very difficult to get statistically sound data. – osmak Oct 01 '16 at 09:06
  • @osmak I see. I've edited my answer to address your comment. If I understand your case, then you'd probably still need to keep in mind that the confidence bounds derived in this fashion are still an approximation for some "best case scenario" of your experiment. – Vlas Sokolov Oct 01 '16 at 11:14
  • Thanks a lot I really appreciate your efforts, you have been very helpful. – osmak Oct 01 '16 at 18:09
  • Nice example, thanks for posting. Note that I got an error running your first code sample: `TypeError: boxplot() got an unexpected keyword argument 'manage_xticks'` This error went away when I removed the `manage_xticks` argument from the call to `boxplot`. – Steve Mar 11 '21 at 13:54
  • @Steve there was an API change in matplotlib, in newer versions `manage_xticks` became `manage_ticks`. – Vlas Sokolov Mar 12 '21 at 14:26
  • “The square root of the diagonal entries of covariance matrix can then be used to calculate your confidence intervals” – doesn’t that throw away the covariance between the parameters? How would I take them into account? – Joachim Breitner Mar 19 '23 at 16:00
  • 1
    @JoachimBreitner yeah that's basically an implicit assumption there as I've used `np.diagonal`. If your parameter space is non-trivial you shouldn't be using point estimation methods such as this one and should instead use something that samples an entire parameter space. Bayesian methods typically rock there. But again, I'm not a statistician so I probably can't tell for sure :) – Vlas Sokolov Mar 20 '23 at 17:27
  • Is it normal that this fails when the error on one of the data points is 0? For instance, if I do `y_spread[3]=0` and re-run, I get `RuntimeWarning: divide by zero encountered in divide`. What is the approrpriate thing to do in this case? – sodiumnitrate Mar 30 '23 at 14:41