0

The original problem

While translating MATLAB code to python, I have the function [parmhat,parmci] = gpfit(x,alpha). This function fits a Generalized Pareto Distribution and returns the parameter estimates, parmhat, and the 100(1-alpha)% confidence intervals for the parameter estimates, parmci.

MATLAB also provides the function gplike that returns acov, the inverse of Fisher's information matrix. This matrix contains the asymptotic variances on the diagonal when using MLE. I have the feeling this can be coupled to the confidence intervals as well, however my statistics background is not strong enough to understand if this is true.

What I am looking for is Python code that gives me the parmci values (I can get the parmhat values by using scipy.stats.genpareto.fit). I have been scouring Google and Stackoverflow for 2 days now, and I cannot find any approach that works for me.

While I am specifically working with the Generalized Pareto Distribution, I think this question can apply to many more (if not all) distributions that scipy.stats has.

My data: I am interested in the shape and scale parameters of the generalized pareto fit, the location parameter should be fixed at 0 for my fit.

What I have done so far

scipy.stats While scipy.stats provides nice fitting performance, this library does not offer a way to calculate the confidence interval on the parameter estimates of the distribution fitter.

scipy.optimize.curve_fit As an alternative I have seen suggested to use scipy.optimize.curve_fit instead, as this does provide the estimated covariance of the parameter estimated. However that fitting method uses least squares, whereas I need to use MLE and I didn't see a way to make curve_fit use MLE instead. Therefore it seems that I cannot use curve_fit.

statsmodel.GenericLikelihoodModel Next I found a suggestion to use statsmodel.GenericLikelihoodModel. The original question there used a gamma distribution and asked for a non-zero location parameter. I altered the code to:

import numpy as np
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import genpareto


# Data contains 24 experimentally obtained values
data = np.array([3.3768732 , 0.19022354, 2.5862942 , 0.27892331, 2.52901677,
       0.90682787, 0.06842895, 0.90682787, 0.85465385, 0.21899145,
       0.03701204, 0.3934396 , 0.06842895, 0.27892331, 0.03701204,
       0.03701204, 2.25411215, 3.01049545, 2.21428639, 0.6701813 ,
       0.61671203, 0.03701204, 1.66554224, 0.47953739, 0.77665706,
       2.47123239, 0.06842895, 4.62970341, 1.0827188 , 0.7512669 ,
       0.36582134, 2.13282122, 0.33655947, 3.29093622, 1.5082936 ,
       1.66554224, 1.57606579, 0.50645878, 0.0793677 , 1.10646119,
       0.85465385, 0.00534871, 0.47953739, 2.1937636 , 1.48512994,
       0.27892331, 0.82967374, 0.58905024, 0.06842895, 0.61671203,
       0.724393  , 0.33655947, 0.06842895, 0.30709881, 0.58905024,
       0.12900442, 1.81854273, 0.1597266 , 0.61671203, 1.39384127,
       3.27432715, 1.66554224, 0.42232511, 0.6701813 , 0.80323855,
       0.36582134])
params = genpareto.fit(data, floc=0, scale=0) 
# HOW TO ESTIMATE/GET ERRORS FOR EACH PARAM?

print(params)
print('\n')


class Genpareto(GenericLikelihoodModel):

    nparams = 2

    def loglike(self, params):
        # params = (shape, loc, scale)
        return genpareto.logpdf(self.endog, params[0], 0, params[2]).sum()


res = Genpareto(data).fit(start_params=params)
res.df_model = 2
res.df_resid = len(data) - res.df_model
print(res.summary())

This gives me a somewhat reasonable fit:

  • Scipy stats fit: (0.007194143471555344, 0, 1.005020562073944)
  • Genpareto fit: (0.00716650293, 8.47750397e-05, 1.00504535)

However in the end I get an error when it tries to calculate the covariance:

HessianInversionWarning: Inverting hessian failed, no bse or cov_params available

If I do return genpareto.logpdf(self.endog, *params).sum() I get a worse fit compared to scipy stats.

Bootstrapping Lastly I found mentions to bootstrapping. While I did sort of understand what's the idea behind it, I have no clue how to implement it. What I understand is that you should resample N times (1000 for example) from your data set (24 points in my case). Then do a fit on that sub-sample, and register the fit result. Then do a statistical analysis on the N results, i.e. calculating mean, std_dev and then confidence interval, like Estimate confidence intervals for parameters of distribution in python or Compute a confidence interval from sample data assuming unknown distribution. I even found some old MATLAB documentation on the calculations behind gpfit explaining this.

However I need my code to run fast, and I am not sure if any implementation that I make will do this calculation fast.

Conclusions Does anyone know of a Python function that calculates this in an efficient manner, or can point me to a topic where this has been explained already in a way that it works for my case at least?

tobyvd
  • 110
  • 1
  • 9

1 Answers1

0

I had the same issue with GenericLikelihoodModel and I came across this post (https://pystatsmodels.narkive.com/9ndGFxYe/mle-error-warning-inverting-hessian-failed-maybe-i-cant-use-matrix-containers) which suggests using different starting parameter values to get a result with positive hessian. Solved my problem.