0

I found the explication about how to enter link description here I need to compare my distribution based on Kolmogorov-Smirnov Test between my sample and each of the distributions to the fit. But I do not know how to interpret results and choose the best distribution based on this test? This code does not implement Kolmogorov-Smirnov Test.So 1 -How to implement the kolmogorov-smirnov test? 2 - How to choose the best distribution?

def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [st.alpha, st.anglit]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    runs = []
    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)
                print(params)
                # Separate parts of parameters
                arg = params[:-2]
                print(arg)
                loc = params[-2]
                print(loc)
                scale = params[-1]
                print(scale)

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                runs.append([distribution.name, sse])
                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass
    print(runs)
    return (best_distribution.name, best_params)
dina
  • 260
  • 6
  • 16
  • Firstly, thank you very much for your help. Secondly, I found the code here https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python for that I follow the same forum. Also, I need to compute this test with python – dina Jan 28 '19 at 20:19
  • @pjs I edit my question. I am sorry. I wasn't careful. – dina Jan 28 '19 at 20:54

1 Answers1

0

First of all, let me notice that the snippet of source code you provided doesn´t include a Kolmogorov-Smirnov test, instead it is doing a parametric MLE estimation and then calculates sum of squared errors to choose the best fit.

To answer your first question, let me show an example of Kolmogorov-Smirnov goodness-of-fit test for Normal distribution in scipy.stats:

stats.kstest(samples, 'norm', args=(0, 1))

where

  • samples - the collected/observed experimental data
  • 'norm' - the predefined name of the theoretical continuous distribution
  • args - the parameters of the theoretical distribution, in the example mean=0 and std=1

So to make a test with others distributions, one just needs to iterate through required theoretical distribution names and their parameters in the same way as for Normal distribution in the example above.

The stats.kstest function returns two values:

  • D - a K-S statistics
  • p-value - a p-value for the null hypothesis that the samples were drawn from the provided theoretical distribution

So to answer your second question, you should reject the test if the p-value is less than your significance value. In case if null hypothesis cannot be rejected then you can compare D values and choose the distribution with the least value of D since it signifies the goodness-of-fit: the less value of D, the better it fits the data.

Michael Glazunov
  • 336
  • 1
  • 3
  • 7
  • thank you very much, this code does not include Kolmogorov-Smirnov test. I use this code as reference since it contains dataset and it is easy to change. – dina Jan 28 '19 at 21:33