5

I want to find the distribution that best fit some data. This would typically be some sort of measurement data, for instance force or torque.

Ideally I want to run Anderson-Darling with multiple distributions and select the distribution with the highest p-value. This would be similar to the 'Goodness of fit' test in Minitab. I am having trouble finding a python implementation of Anderson-Darling that calculates the p-value.

I have tried scipy's stats.anderson() but it only returns the AD-statistic and a list of critical values with the corresponding significance levels, not the p-value itself.

I have also looked into statsmodels, but it seems to only support the normal distribution. I need to compare the fit of several distributions (normal, weibull, lognormal etc.).

Is there an implementation of the Anderson-Darling in python that returns p-value and supports nonnormal distributions?

3 Answers3

2

I would just rank distributions by the goodness-of-fit statistic and not by p-values. We can use the Anderson-Darling, Kolmogorov-Smirnov or similar statistic just as distance measure to rank how well different distributions fit.

background:

p-values for Anderson-Darling or Kolmogorov-Smirnov depend on whether the parameters are estimated or not. In both cases the distribution is not a standard distribution.

In some cases we can tabulate or use a functional approximation to tabulated values. This is the case when parameters are not estimated and if the distribution is a simple location-scale family without shape parameters.

For distributions that have a shape parameter, the distribution of the test statistic that we need for computing the p-values depends on the parameters. That is we would have to compute different distributions or tabulated p-values for each set of parameters, which is impossible. The only solution to get p-values in those cases is either by bootstrap or by simulating the test statistic for the specific parameters.

The technical condition is whether the test statistic is asymptotically pivotal which means that the asymptotic distribution of the test statistic is independent of the specific parameters.

Using chisquare test on binned data requires fewer assumption, and we can compute it even when parameters are estimated. (Strictly speaking this is only true if the parameters are estimated by MLE using the binned data.)

Josef
  • 21,998
  • 3
  • 54
  • 67
  • 2
    Thank you for your answer. You propose using the test statistic because it is a measure of the fit. However I have noticed that _Minitab_ specifically warns against using this statistic to determine best fit: _"However, avoid directly comparing AD values across different distributions when the AD values are close, because AD statistics are distributed differently for different distributions. To better compare the fit of different distributions, use additional criteria, such as the probability plots, the p-values, and your process knowledge."_ They propose the p-value as a better measure of fit – Christian Erichsen Jun 12 '18 at 19:16
  • Here is the link to the Minitab documentation: [http://support.minitab.com/en-us/minitab/18/help-and-how-to/quality-and-process-improvement/quality-tools/how-to/individual-distribution-identification/interpret-the-results/all-statistics-and-graphs/goodness-of-fit/](http://support.minitab.com/en-us/minitab/18/help-and-how-to/quality-and-process-improvement/quality-tools/how-to/individual-distribution-identification/interpret-the-results/all-statistics-and-graphs/goodness-of-fit/) – Christian Erichsen Jun 12 '18 at 19:21
  • 2
    That Minitab comment doesn't make much sense to me, and I have no idea how they compute p-values for distributions with shape parameters, unless they use simulated values or restrict to distributions without shape parameters. AD and KS and similar GOF statistics are just distance measures between hypothesized and empirical distributions. The smaller the test statistic, the closer is the distribution to the data in the given definition of the distance measure. – Josef Jun 13 '18 at 00:01
  • 2
    Using probability plots as additional aid is always useful because it provides additional information where the distribution might fit well or not well. If we use the p-values for the case when the parameters are not estimated, then they will not be correct in the case when the parameters are estimated. – Josef Jun 13 '18 at 00:01
1

You can check this page base on OpenTURNS library. Basically, if x is a Python list or a Numpy array,

import openturns as ot
sample = ot.Sample(x)

the call the Anderson Darling method test_result = ot.NormalityTest.AndersonDarlingNormal(sample)

The p_value is obtained by calling test_result.getPValue()

Jean A.
  • 291
  • 1
  • 17
1

You could use multiple distributions, it just needs to be Callable. See below how I called gamma.

from statsmodels.stats.diagnostic import anderson_statistic as ad_stat
from scipy import stats

result = ad_stat(df[['Total']], dist= stats.gamma)

You could call any distribution listed in Scipy: https://docs.scipy.org/doc/scipy/reference/stats.html

See source code for more info: https://www.statsmodels.org/stable/_modules/statsmodels/stats/_adnorm.html