30

I have a data set on N numbers that I want to test for normality. I know scipy.stats has a kstest function but there are no examples on how to use it and how to interpret the results. Is anyone here familiar with it that can give me some advice?

According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value. If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.

When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:

import numpy as np
from scipy.stats import kstest

mu,sigma = 0.07, 0.89
kstest(np.random.normal(mu,sigma,10000),'norm')

I get the following output:

(0.04957880905196102, 8.9249710700788814e-22)

The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!

Can someone understand and explain to me the discrepancy here?

(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)

sophros
  • 14,672
  • 11
  • 46
  • 75
Hooloovoo
  • 865
  • 2
  • 11
  • 21

4 Answers4

26

Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.

The null hypothesis (H0) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.

The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.

In other words, (with p-value ~8.9e-22) it is highly unlikely that H0 is true.

That is reasonable, since the means and std deviations don't match.

Compare your result with:

In [22]: import numpy as np
In [23]: import scipy.stats as stats
In [24]: stats.kstest(np.random.normal(0,1,10000),'norm')
Out[24]: (0.007038739782416259, 0.70477679457831155)

To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:

data=np.random.normal(mu,sigma,10000)
normed_data=(data-mu)/sigma
print(stats.kstest(normed_data,'norm'))
# (0.0085805670733036798, 0.45316245879609179)

Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know mu and sigma, estimating the parameters makes the p-value invalid:

import numpy as np
import scipy.stats as stats

mu = 0.3
sigma = 5

num_tests = 10**5
num_rejects = 0
alpha = 0.05
for i in xrange(num_tests):
    data = np.random.normal(mu, sigma, 10000)
    # normed_data = (data - mu) / sigma    # this is okay
    # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)
    normed_data = (data - data.mean()) / data.std()    # this is NOT okay
    # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
    D, pval = stats.kstest(normed_data, 'norm')
    if pval < alpha:
        num_rejects += 1
ratio = float(num_rejects) / num_tests
print('{}/{} = {:.2f} rejects at rejection level {}'.format(
    num_rejects, num_tests, ratio, alpha))     

prints

20/100000 = 0.00 rejects at rejection level 0.05 (not expected)

which shows that stats.kstest may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation

normed_data = (data - data.mean()) / data.std()    # this is NOT okay
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Hi unutbu, that makes sense, thanks! How can I test whether my data is gaussianly distributed with a different mu and sigma value as given in the example? If I can define a gaussian as g = lambda x: ( 1.0/(np.sqrt(2.0*np.pi*sigma**2)) )* exp(-((x-mu)**2/(2.0*sigma**2) ) ) , can i then run the kstest as kstest(data,g) ? – Hooloovoo Oct 26 '11 at 15:07
  • You can supply a callable as the second argument to `kstest`, but the callable must represent the cumulative distribution function, not the probability density function. I think in this case, it would be easier to renormalize your data (as shown above). – unutbu Oct 26 '11 at 16:38
  • 8
    Kolmogorov-Smirnov test assumes that the parameters are not estimated. If you normalize or use estimated parameters, then the pvalues of kstest are not correct. If you just want to test whether the data is normal distributed for whatever mu and sigma are, then I would recommend other tests. – Josef Oct 27 '11 at 05:13
  • hi, which tests are there for just testing? – user528025 Sep 17 '12 at 12:47
  • @user528025: Hi, I do not know exactly what you mean. If you are asking a statistics question, you may be better of posting a question at [Cross Validated](http://stats.stackexchange.com/). – unutbu Sep 17 '12 at 16:28
  • Thanks. I wish I could upvote this more than once :) Really helpful. – Filipe Correia Feb 14 '13 at 01:11
  • i'm confused a bit - how can you *know* mu and sigma in the real world? you can only estimate them with some confidence. so then you're saying the 2-pair ks test not valid for real-world data? how do you prove to yourself that you are allowed to use it? i have two real-world samples and i want ask if they came from the same underlying distribution (unknown to me). – user391339 Sep 17 '14 at 02:47
  • @user391339: This questions was about the use of `stats.kstest`. I think the problem you are contemplating could be solved using [stats.ks_2samp](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ks_2samp.html). I'm not an expert at stats though, so you may want to search the questions or ask a new question at [Cross Validated](http://stats.stackexchange.com/). – unutbu Sep 17 '14 at 13:45
  • @user333700 Do you know which test(s) could be used to provide correct p-values, when using estimated parameters from real data (with unknown distribution)? – Julia Jun 27 '17 at 15:13
  • @Julia What do you mean with unknown distribution? The hypothesis test is for a specific distribution or distribution family. If you fit many different distributions, then ks statistic or similar can be used to rank them by goodness of fit, but that wouldn't be a hypothesis test. – Josef Jun 27 '17 at 15:20
  • @user333700 What I mean is that I have data sample and I want to find out which distribution describes this data sample best. I use fit function to estimate parameters for all continuous distributions from `scipy.stats` and then I use kstest (for all continuous distributions), providing it data sample, distribution name and estimated parameters to see if distributions are the same. – Julia Jun 27 '17 at 15:51
  • @Julia For that purpose I used the ks statistic and the chisquare test with binned data for the ranking and pick the highest, or more standard distribution if the top ranking distributions are close to each other. The p-values will not be correct as a test, but even if they would be, it is often the case that several distributions are not rejected in small samples and all distributions are rejected in very large samples. What's a "good enough" match depends on the purpose and usage. – Josef Jun 27 '17 at 16:06
  • @user333700 I have run `kstest` for small (1000) and large (1000000) samples and all distributions are rejected for both cases. I will look into the chisquare test. – Julia Jun 28 '17 at 07:22
  • you can plot the histogram with overlayed estimated pdf for the top distribution to check why they are rejected and whether the rejection means much. "small" is often below 20 or 100 observations. At 1 Million observations even tiny differences in the distribution will cause a rejection. – Josef Jun 28 '17 at 09:33
  • @user333700 Could you please explain why p-value of kstest is not valid for estimated parameters? Here is what I did: `In: x = stats.triang(1, 1, 2).rvs(size=1000)`, `In: x_fit = stats.triang.fit(x)`, `In: print "estimated parameters ", x_fit` -> `Out: estimated parameters (0.99777265575186269, 0.99386709068107804, 2.0064586950934324)`. Now with kstest `In: print "triang ", stats.kstest(x, 'triang', args=x_fit)`-> `Out: triang KstestResult(statistic=0.027106168438533224, pvalue=0.45188497115983406)`. This seems to work with estimated parameters and rather large data sample. – Julia Jun 29 '17 at 07:45
13

An update on unutbu's answer:

For distributions that only depend on the location and scale but do not have a shape parameter, the distributions of several goodness-of-fit test statistics are independent of the location and scale values. The distribution is non-standard, however, it can be tabulated and used with any location and scale of the underlying distribution.

The Kolmogorov-Smirnov test for the normal distribution with estimated location and scale is also called the Lilliefors test.

It is now available in statsmodels, with approximate p-values for the relevant decision range.

>>> import numpy as np
>>> mu,sigma = 0.07, 0.89
>>> x = np.random.normal(mu, sigma, 10000)
>>> import statsmodels.api as sm
>>> sm.stats.lilliefors(x)
(0.0055267411213540951, 0.66190841161592895)

Most Monte Carlo studies show that the Anderson-Darling test is more powerful than the Kolmogorov-Smirnov test. It is available in scipy.stats with critical values, and in statsmodels with approximate p-values:

>>> sm.stats.normal_ad(x)
(0.23016468240712129, 0.80657628536145665)

Neither of the test rejects the Null hypothesis that the sample is normal distributed. While the kstest in the question rejects the Null hypothesis that the sample is standard normal distributed.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
Josef
  • 21,998
  • 3
  • 54
  • 67
  • Anderson (or Shapiro or D'Agostino) all fail with large sample sizes though – Oli Dec 03 '20 at 16:14
  • What does it mean they fail? You can ask a separate question with the details of the problems. – Josef Dec 03 '20 at 22:19
3

You may also want to consider using the Shapiro-Wilk test, which "tests the null hypothesis that the data was drawn from a normal distribution." It's also implemented in scipy:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html

You'll need to pass your data directly into the function.

import scipy

W, p = scipy.stats.shapiro(dataset)
print("Shapiro-Wilk test statistic, W:", W, "\n", "p-value:", p)

Which returns something like:

 Shapiro-Wilk test statistic, W: 0.7761164903640747 
 p-value: 6.317247641091492e-37

With p << 0.01 (or 0.05, if you prefer - it doesn't matter,) we have good reason to reject the null hypothesis that these data were drawn from a normal distribution.

Halyn Betchkal
  • 243
  • 5
  • 15
  • 1
    The linked documentation states that for N > 5000, the p-value is not guaranteed to be correct. The original question used 10000 as an example. – kjyv Apr 27 '16 at 13:44
1

As a complement to the answer by @unutbu , you could also provide the distribution parameters for the test distribution in kstest. Suppose that we had some samples from a variable (and named them datax), and we wanted to check if those samples could possibly not come from a lognormal, a uniform, or a normal. Note that for scipy stats the way the input parameters are taken for each distribution varies a bit. Now, thanks to "args" (tuple or sequence) in kstest, is possible provide the arguments for the scipy.stats distribution you want to test against.

:) I also added the option of using a two-sample test, in case you wanted to do it either way:

import numpy as np
from math import sqrt
from scipy.stats import kstest, ks_2samp, lognorm
import scipy.stats

def KSSeveralDists(data,dists_and_args,samplesFromDists=100,twosampleKS=True):
    returnable={}
    for dist in dists_and_args:
        try:
            if twosampleKS:
                try:
                    loc=dists_and_args[dist][0]
                    scale=dists_and_args[dist][1]
                    expression='scipy.stats.'+dist+'.rvs(loc=loc,scale=scale,size=samplesFromDists)'
                    sampledDist=eval(expression)
                except:
                    sc=dists_and_args[dist][0]
                    loc=dists_and_args[dist][1]
                    scale=dists_and_args[dist][2]
                    expression='scipy.stats.'+dist+'.rvs(sc,loc=loc,scale=scale,size=samplesFromDists)'
                    sampledDist=eval(expression)
                D,p=ks_2samp(data,sampledDist)
            else:
                D,p=kstest(data,dist,N=samplesFromDists,args=dists_and_args[dist])
        except:
            continue
        returnable[dist]={'KS':D,'p-value':p}
    return returnable

a=lambda m,std: m-std*sqrt(12.)/2.
b=lambda m,std: m+std*sqrt(12.)/2.
sz=2000

sc=0.5 #shape 
datax=lognorm.rvs(sc,loc=0.,scale=1.,size=sz)
normalargs=(datax.mean(),datax.std())

#suppose these are the parameters you wanted to pass for each distribution
dists_and_args={'norm':normalargs,
               'uniform':(a(*normalargs),b(*normalargs)),
               'lognorm':[0.5,0.,1.]
              }
print "two sample KS:"
print KSSeveralDists(datax,dists_and_args,samplesFromDists=sz,twosampleKS=True)
print "one sample KS:"
print KSSeveralDists(datax,dists_and_args,samplesFromDists=sz,twosampleKS=False)

which gives as an output something like:

two sample KS: {'lognorm': {'KS': 0.023499999999999965, 'p-value': 0.63384188886455217}, 'norm': {'KS': 0.10600000000000004, 'p-value': 2.918766666723155e-10}, 'uniform': {'KS': 0.15300000000000002, 'p-value': 6.443660021191129e-21}}

one sample KS: {'lognorm': {'KS': 0.01763415915126032, 'p-value': 0.56275820961065193}, 'norm': {'KS': 0.10792612430093562, 'p-value': 0.0}, 'uniform': {'KS': 0.14910036159697559, 'p-value': 0.0}}

Note: For the scipy.stats uniform distribution, a and b are taken as a=loc and b=loc + scale (see documentation).