how to select best fit continuous distribution from two Goodness-to-fit tests?

Question

I looked into the question Best fit Distribution plots and found out that answers submitted were using the Kolmogorov-Smirnov Test to find the best fit distribution. I also found out that there is an Anderson-Darling test that is also used to get the best fit distribution. So, I have a few questions:

Question 1:

If I have data and pass it through the NumPy histogram, what parameters should I use and what output should I input into the distribution?

def get_hist(data, data_size):
#### General code:
bins_formulas = ['auto', 'fd', 'scott', 'rice', 'sturges', 'doane', 'sqrt']
# bins = np.histogram_bin_edges(a=data, bins='scott')
# bins = np.histogram_bin_edges(a=data, bins='auto')
bins = np.histogram_bin_edges(a=data, bins='fd')
# print('Bin value = ', bins)

# Obtaining the histogram of data:
# Hist, bin_edges = histogram(a=data, bins=bins, range=np.linspace(start=np.min(data),end=np.max(data),size=data_size), density=True)
# Hist, bin_edges = histogram(a=data, range=np.linspace(np.min(data), np.max(data), data_size), density=True)
# Hist, bin_edges = histogram(a=data, bins=bins, density=True)
# Hist, bin_edges = histogram(a=data, bins=bins, range=(min(data), max(data)), normed=True, density=True)
# Hist, bin_edges = histogram(a=data, density=True)
Hist, bin_edges = histogram(a=data, range=(min(data), max(data)), density=True)
return Hist

Question 2:

If I want to combine both tests, how can I do that? what parameters are the best to use for finding the best fit distribution? Here is my attempt in combining both tests.

from statsmodels.stats.diagnostic import anderson_statistic as adtest
def get_best_distribution(data):
    dist_names = ['alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat',  'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'moyal', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']
    dist_ks_results = []
    dist_ad_results = []
    params = {}
    for dist_name in dist_names:
        dist = getattr(st, dist_name)
        param = dist.fit(data)
        params[dist_name] = param

        # Applying the Kolmogorov-Smirnov test
        D_ks, p_ks = st.kstest(data, dist_name, args=param)
        print("Kolmogorov-Smirnov test Statistics value for " + dist_name + " = " + str(D_ks))
        # print("p value for " + dist_name + " = " + str(p_ks))
        dist_ks_results.append((dist_name, p_ks))

        # Applying the Anderson-Darling test:
        D_ad = adtest(x=data, dist=dist, fit=False, params=param)
        print("Anderson-Darling test Statistics value for " + dist_name + " = " + str(D_ad))
        dist_ad_results.append((dist_name, D_ad))

        print(dist_ks_results)
        print(dist_ad_results)

        for D in range (len(dist_ks_results)):
           KS_D = dist_ks_results[D][1]
           AD_D = dist_ad_results[D][1]
           if KS_D < 0.25 and AD_D < 0.05:
                best_ks_D = KS_D
                best_ad_D = AD_D
                if dist_ks_results[D][1] == best_ks_D:
                   best_ks_dist = dist_ks_results[D][0]
                if dist_ad_results[D][1] == best_ad_D:
                   best_ad_dist = dist_ad_results[D][0]

            print(best_ks_D)
            print(best_ad_D)
            print(best_ks_dist)
            print(best_ad_dist)

            print('\n################################ Kolmogorov-Smirnov test parameters #####################################')
            print("Best fitting distribution (KS test): " + str(best_ks_dist))
            print("Best test Statistics value (KS test): " + str(best_ks_D))
            print("Parameters for the best fit (KS test): " + str(params[best_ks_dist])
            print('################################################################################\n')
            print('################################ Anderson-Darling test parameters #########################################')
            print("Best fitting distribution (AD test): " + str(best_ad_dist))
            print("Best test Statistics value (AD test): " + str(best_ad_D))
            print("Parameters for the best fit (AD test): " + str(params[best_ad_dist]))
            print('################################################################################\n')

Question 3:

How can I obtain the p-value for the Anderson-Darling test?

Question 4:

Say that I managed to get the best fit distribution, how is it possible to rank the distributions based on the tests? like the photo below.

Goodness-to-fit tests with ranking

Edit 1

I am not sure but is the normal_ad from statsmodel general Anderson-Darling test for any continuous probability distribution? if it is, I would like to select the distribution that is common for both tests, If I follow the same steps in question 1 will it be the right approach? Also, say if I want to find the highest p-value and is common in both tests, how can I extract the common distribution name with the p-values?

def get_best_distribution(data):
dist_names = ['beta', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'erlang', 'expon', 'f', 'fatiguelife', 'fisk', 'gamma', 'genlogistic', 'genpareto', 'invgauss', 'johnsonsb', 'johnsonsu', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'maxwell', 'mielke', 'norm', 'pareto', 'reciprocal', 'rayleigh', 't', 'triang', 'uniform', 'weibull_min', 'weibull_max']
dist_ks_results = []
dist_ad_results = []
params = {}
for dist_name in dist_names:
    dist = getattr(st, dist_name)
    param = dist.fit(data)
    params[dist_name] = param

    # Applying the Kolmogorov-Smirnov test
    D_ks, p_ks = st.kstest(data, dist_name, args=param)
    print("Kolmogorov-Smirnov test Statistics value for " + dist_name + " = " + str(D_ks))
    print("p value (KS test) for " + dist_name + " = " + str(p_ks))
    dist_ks_results.append((dist_name, p_ks))

    # Applying the Anderson-Darling test:
    D_ad, p_ad = adnormtest(x=data, axis=0)
    print("Anderson-Darling test Statistics value for " + dist_name + " = " + str(D_ad))
    print("p value (AD test) for " + dist_name + " = " + str(p_ad))
    dist_ad_results.append((dist_name, p_ad))

# select the best fitted distribution:
best_ks_dist, best_ks_p = (max(dist_ks_results, key=lambda item: item[1]))
best_ad_dist, best_ad_p = (max(dist_ad_results, key=lambda item: item[1]))

print('\n################################ Kolmogorov-Smirnov test parameters #####################################')
print("Best fitting distribution (KS test) :" + str(best_ks_dist))
print("Best p value (KS test) :" + str(best_ks_p))
print("Parameters for the best fit (KS test) :" + str(params[best_ks_dist]))
print('###########################################################################################################\n')
print('################################ Anderson-Darling test parameters #########################################')
print("Best fitting distribution (AD test) :" + str(best_ad_dist))
print("Best p value (AD test) :" + str(best_ad_p))
print("Parameters for the best fit (AD test) :" + str(params[best_ad_dist]))
print('###########################################################################################################\n')
if best_ks_dist == best_ad_dist:
    best_common_dist = best_ks_dist
    print('##################################### Both test parameters ############################################')
    print("Best fitting distribution (Both test) :" + str(best_common_dist))
    print("Best p value (KS test) :" + str(best_ks_p))
    print("Best p value (AD test) :" + str(best_ad_p))
    print("Parameters for the best fit (Both test) :" + str(params[best_common_dist]))
    print('###########################################################################################################\n')
    return best_common_dist, best_ks_p, params[best_common_dist]

Question 5:

Correct me if I am wrong when implementing the Goodness-to-Fit test, the p-value obtained is used in order to check if the given values fit within any of the mentioned distributions. So, the maximum value of p-value means that the p-value lies below the %5 significant level of which, therefore, for example, Gamma distribution fits the data. Am I right or did I miss understood the main concept of the Goodness-to-Fit test?

score 2 · Accepted Answer · answered Apr 30 '20 at 20:51

The question 3 is easy to solve with OpenTURNS. I generally rank distributions with the Bayesian information criterion, because it allows to rank as being better the distributions which have fewer parameters.

In the following example, I create a gaussian distribution and generate a sample from it. Then I compute the BIC scores with the FittingTest.BIC function on the 30 distributions in the library. I then use the np.argsort function to get the sorted indices and print the results.

import openturns as ot
import numpy as np
# Generate a sample
distribution = ot.Normal()
sample = distribution.getSample(100)
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
nbmax = len(tested_factories)
# Compute BIC scores
bic_scores = []
names = []
for i in range(nbmax):
    factory = tested_factories[i]
    names.append(factory.getImplementation().getClassName())
    try:
        fitted_dist, bic = ot.FittingTest.BIC(sample, factory)
    except:
        bic = np.inf
    bic_scores.append(bic)
# Sort the scores
indices = np.argsort(bic_scores)
# Print result
for i in range(nbmax):
    factory = tested_factories[i]
    name = factory.getImplementation().getClassName()
    print(names[indices[i]], ": ", i, bic_scores[indices[i]])

This produces:

NormalFactory :  0 2.902476153791324
TruncatedNormalFactory :  1 2.9391403094910493
LogisticFactory :  2 2.945101831314491
LogNormalFactory :  3 2.948479498106734
StudentFactory :  4 2.9487326727806438
WeibullMaxFactory :  5 2.9506160993704653
WeibullMinFactory :  6 2.9646030668970464
TriangularFactory :  7 2.9683050343363897
TrapezoidalFactory :  8 2.970676202179786
BetaFactory :  9 3.033244379700322
RayleighFactory :  10 3.0511170157342207
LaplaceFactory :  11 3.0641174552986796
FrechetFactory :  12 3.1472260896504327
UniformFactory :  13 3.1551588725784927
GumbelFactory :  14 3.1928562445001263
HistogramFactory :  15 3.3881831435932748
GammaFactory :  16 3.3925823197940552
ExponentialFactory :  17 3.824030948338899
ArcsineFactory :  18 214.7536151046246
ChiFactory :  19 680.8835152447839
ChiSquareFactory :  20 683.6769102883109
FisherSnedecorFactory :  21 inf
LogUniformFactory :  22 inf
GeneralizedParetoFactory :  23 inf
RiceFactory :  24 inf
DirichletFactory :  25 inf
BurrFactory :  26 inf
InverseNormalFactory :  27 inf
MeixnerDistributionFactory :  28 inf
ParetoFactory :  29 inf

There are distributions which cannot be fit on this sample. On these distributions, I set the BIC to INF and wrap the exception in a try/except.

Thank you for your answer, I will give it a try. I am just stuck with question 1, 2 and edit 1. Have you been through this situation? — WDpad159, May 01 '20 at 01:33
Sorry, but I have no good idea on question 1... except that I would rather not do it personally. — Michael Baudin, May 01 '20 at 06:44
Also, is it common for some distributions to have same p-value? if it is what can I do in order to have unique values for each distribution? — WDpad159, May 01 '20 at 13:26
Combining two methods to define a new one is not unusual. But the Kolmogorov-Test and the Anderson-Darling tests have different goals, so I cannot find a way to merge them. By the way, for the Kolmogorov-Smirnov test, your way of using the test is, I think, wrong : you should use the Lillifors test, as shown in https://stackoverflow.com/questions/57354430/goodness-of-fit-test-for-weibull-distribution-in-python/59096874#59096874 — Michael Baudin, May 01 '20 at 16:16
I have a question, say that I used the scipy for the KS test and got the test statistics and p-value. Correct me if I am wrong when implementing the Goodness-to-Fit test, the p-value obtained is used in order to check if the given values fit within any of the mentioned distributions. So, the maximum value of p-value means that the p-value lies below the %5 significant level of which, therefore, for example, Gamma distribution fits the data. Am I right or did I miss understood the main concept of the Goodness-to-Fit test? — WDpad159, May 14 '20 at 18:32

score 0 · Answer 2 · answered May 01 '20 at 06:43

The question 2. can be solved with the NormalityTest.AndersonDarlingNormal class:

import openturns as ot
distribution = ot.Normal()
sample = distribution.getSample(100)
test_result = ot.NormalityTest.AndersonDarlingNormal(sample)
print(test_result.getPValue())

This prints:

0.8267360272974381

The API is documented in the help page of the function, there is an example and the theory is documented here.

how to select best fit continuous distribution from two Goodness-to-fit tests?

2 Answers2