4

I'm currently trying to implement a confidence interval on some parameters, with the bootstrap method. However, I have a little problem. Even if I use 3000 samples, my confidence intervals vary a lot.

Here is the situation:

I have a dataset of about 300 points, defined in the traditional way y=f(x). I know the model which fits the data. So I find the parameters with curve_fit, and I try to establish the confidence interval for each of them. I tried to mix the methods described here:

confidence interval with leastsq fit in scipy python

and here:

http://www.variousconsequences.com/2010/02/visualizing-confidence-intervals.html

Here is the code I use:

def model(t, Vs, Vi, k):

    """
    Fitting model, following a Burst kinetics.
    t is the time
    Vs is the steady velocity
    Vi is the initial velocity
    k is the Burst rate constant
    """

    y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)

    return y



[some code]

bootindex = np.random.random_integers
nboot = 3000


local_t = np.array(local_t)
local_fluo = np.array(local_fluo)
concentration = np.array(concentration)

#Initializing time values in hours
local_scaled_t = [ index /3600 for index in local_t ]
local_scaled_t = np.array(local_scaled_t)

conc_produit = [ concentration[0] - value_conc for value_conc in concentration ]
conc_produit = np.array(conc_produit)

popt, pcov = curve_fit(model, local_scaled_t, conc_produit, maxfev=3000)
popt = [ popt[0] / 3600, popt[1] / 3600 , popt[2] / 3600 ]

ymod = list()
for each in local_t:
        ymod.append(model(each, popt[0], popt[1], popt[2]))
ymod = np.array(ymod)

r = conc_produit - ymod

list_para = list()

# loop over n bootstrap samples from the resids 
for i in range(nboot): 

    pc, pout = curve_fit(model, local_scaled_t, ymod + r[bootindex(0, len(r)-1, len(r))], maxfev=3000) 
    pc = [ pc[0] / 3600, pc[1] / 3600 , pc[2] / 3600 ]

    list_para.append(pc)

    ymod = list()
    for each in local_t:
            ymod.append(model(each, pc[0], pc[1], pc[2]))
    ymod = np.array(ymod)

list_para = np.array(list_para)

mean_params = np.mean(list_para,0)
std_params = np.std(list_para,0)

print(popt)
for true_para, para, std in zip(popt, mean_params, std_params):
    print("{0} between {1} and {2}".format(round(true_para, 6), round(para - std * 1.95996, 6), round(para + std * 1.95996, 6)))
    print("{0} between {1} and {2}".format(round(true_para, 6), round(para - std * 1.95996, 6), round(para + std * 1.95996, 6)))

Nothing complicated here, just note that I rescale the time to normalize my data and have better parameters.

And finally, here are 2 outputs, for the same code:

[1.9023455671995163e-05, 0.01275941716148471, 0.026540319119773129]
1.9e-05 between 1.6e-05 and 2.1e-05
0.012759 between -0.042697 and 0.092152
0.02654 between -0.073456 and 0.159983

[1.9023455671995163e-05, 0.01275941716148471, 0.026540319119773129]
1.9e-05 between 1.5e-05 and 2.9e-05
0.012759 between -0.116499 and 0.17112
0.02654 between -0.186011 and 0.27797

As you can see, the differences are pretty huge. Is that expected or am I doing something wrong ? For an example, I don't really undestand why I have to multiply by 1.95996 the standard deviation.

Community
  • 1
  • 1
JPFrancoia
  • 4,866
  • 10
  • 43
  • 73
  • Just curious, why bootstrap the errors and not use the covariance matrix? – reptilicus Sep 13 '13 at 18:20
  • 1
    Because I don't really know what I'm doing ? Haha. No, seriously, I don't really understand. Isn't he covariance supposed to give an indication about how the variables are independant ? How can you get a confidence interval from that ? – JPFrancoia Sep 14 '13 at 10:26
  • 1
    Strictly speaking, the covariance indications how the parameter ESTIMATES are correlated. In short, if you just need the CI for each parameters, you just need the diagonal element of `vcov` matrix. Unless you want multidimensional confidence region: http://en.wikipedia.org/wiki/Confidence_region – CT Zhu Sep 14 '13 at 21:53
  • No, that's ok. But just one last thing. If you check my code, I rescale the time, to have it in hours. Like that, the fit is better, since my values for the calculations are small (order of one). Then, I transform back the paramaters into the good unities with a division by 3600. But I have to do the same thing with the covariance, right ? The thing is, when the unit is hour, the CI is about 5% of the parameter value. When the unit is second, the CI is about 100% of the parameter value, even after I divided them by 3600. So how can I have my CI in seconds, just like my parameters ? – JPFrancoia Sep 14 '13 at 22:05

1 Answers1

2

Your curve_fit already gave you the covariance matrix, that is the pout. The 95% confidence limit for your ith parameter is: pc[i]-1.95596*sqrt(pout[i,i]) and pc[i]+1.95596*sqrt(pout[i,i]). 1.95596 is the x, such that cumulative distribution function of the standard normal distribution F(x)=0.975. You can get confidence interval of other levels by using scipy.stats.norm.ppf. See wiki: http://en.wikipedia.org/wiki/1.96

Bootstrap is not going to give you the same (or, sometimes, even close) answers each time you run it. For your specific function, the very few early data points have very big influence on the fit Solve equation with a set of points. I am not sure whether bootstrap is the way to go as if the very few early data points are not sampled, the fit will be very different from the fit of your original data. That also explains why your bootstrap intervals are so difference form each other.

Community
  • 1
  • 1
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • But pout is 3x3 matrix. I can't do that directly. Shall I do np.diag(pcov) ? – JPFrancoia Sep 14 '13 at 10:27
  • Yes, Sorry I missed that. Or `pcov[i,i]`. The off diagonal elements are co-variance. Answer corrected. – CT Zhu Sep 14 '13 at 16:08
  • Okay. But I aked @reptilicus already, isn't he covariance supposed to give an indication about how the variables are independant ? How can you get a confidence interval from that ? – JPFrancoia Sep 14 '13 at 18:11
  • 1
    Also you can think it this way: the diagonal elements are the covariance of each parameter to itself, which is the variance. Taking `sqrt` of those gives you the standard errors. – CT Zhu Sep 14 '13 at 21:56