Suppose I have a sample, which I have reason to believe follows an exponential distribution. I want to estimate the distributions parameter (lambda) and some indication of the confidence. Either the confidence interval or the standard error would be fine. Sadly, scipy.stats.expon.fit
does not seem to permit that. Here's an example, I will use lambda=1/120=0.008333 for the test data:
""" Generate test data"""
import scipy.stats
test_data = scipy.stats.expon(scale=120).rvs(size=3000)
""" Scipy.stats fit"""
fit = scipy.stats.expon.fit(test_data)
print("\nExponential parameters:", fit, " Specifically lambda: ", 1/fit[1], "\n")
# Exponential parameters: (0.0066790678905608875, 116.8376079908356) Specifically lambda: 0.008558887991599736
The answer to a similar question about gamma-distributed data suggests using GenericLikelihoodModel
from the statsmodels
module. While I can confirm that this works nicely for gamma-distributed data, it does not for exponential distributions, because the optimization apparently results in a non-invertible Hessian matrix. This results either from non-finite elements in the Hessian or from np.linalg.eigh
producing non-positive eigenvalues for the Hessian. (Source code here; HessianInversionWarning is raised in the fit
method of the LikelihoodModel
class.).
""" Statsmodel fit"""
from statsmodels.base.model import GenericLikelihoodModel
class Expon(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
return scipy.stats.expon.logpdf(self.endog, *params).sum()
res = Expon(test_data).fit(start_params=fit)
res.df_model = len(fit)
res.df_resid = len(test_data) - len(fit)
print(res.summary())
#Optimization terminated successfully.
# Current function value: 5.760785
# Iterations: 38
# Function evaluations: 76
#/usr/lib/python3.8/site-packages/statsmodels/tools/numdiff.py:352: RuntimeWarning: invalid value encountered in double_scalars
# hess[i, j] = (f(*((x + ee[i, :] + ee[j, :],) + args), **kwargs)
#/usr/lib/python3.8/site-packages/statsmodels/base/model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
# warn('Inverting hessian failed, no bse or cov_params '
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
# return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
# return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:1912: RuntimeWarning: invalid value encountered in less_equal
# cond2 = cond0 & (x <= _a)
# Expon Results
#==============================================================================
#Dep. Variable: y Log-Likelihood: -17282.
#Model: Expon AIC: 3.457e+04
#Method: Maximum Likelihood BIC: 3.459e+04
#Date: Thu, 06 Aug 2020
#Time: 13:55:24
#No. Observations: 3000
#Df Residuals: 2998
#Df Model: 2
#==============================================================================
# coef std err z P>|z| [0.025 0.975]
#------------------------------------------------------------------------------
#par0 0.0067 nan nan nan nan nan
#par1 116.8376 nan nan nan nan nan
#==============================================================================
This seems to happen every single time, so it might be something related to exponentially-distributed data.
Is there some other possible approach? Or am I maybe missing something or doing something wrong here?
Edit: It turns out, I was doing something wrong, namely, I wrongly had
test_data = scipy.stats.expon(120).rvs(size=3000)
instead of
test_data = scipy.stats.expon(scale=120).rvs(size=3000)
and was correspondingly looking at the forst element of the fit tuple while I should have been looking at the second.
As a result, the two other options I considered (manually computing the fit and confidence intervals following the standard procedure described on wikipedia) and using scikits.bootstrap
as suggested in this answer actually does work and is part of the solution I'll add in a minute and not of the problem.