Using Scipy's stats.kstest module for goodness-of-fit testing

Question

I've read through existing posts about this module (and the Scipy docs), but it's still not clear to me how to use Scipy's kstest module to do a goodness-of-fit test when you have a data set and a callable function.

The PDF I want to test my data against isn't one of the standard scipy.stats distributions, so I can't just call it using something like:

kstest(mydata,'norm')

where mydata is a Numpy array. Instead, I want to do something like:

kstest(mydata,myfunc)

where 'myfunc' is the callable function. This doesn't work—which is unsurprising, since there's no way for kstest to know what the abscissa for the 'mydata' array is in order to generate the corresponding theoretical frequencies using 'myfunc'. Suppose the frequencies in 'mydata' correspond to the values of the random variable is the array 'abscissa'. Then I thought maybe I could use stats.ks_2samp:

ks_2samp(mydata,myfunc(abscissa))

but I don't know if that's statistically valid. (Sidenote: do kstest and ks_2samp expect frequency arrays to be normalized to one, or do they want the absolute frequencies?)

In any case, since the one-sample KS test is supposed to be used for goodness-of-fit testing, I have to assume there's some way to do it with kstest directly. How do you do this?

score 19 · Accepted Answer · edited Apr 21 '18 at 10:41

19

Some examples may shed some light on how to use scipy.stats.kstest. Lets first set up some test data, e.g. normally distributed with mean 5 and standard deviation 10:

>>> data = scipy.stats.norm.rvs(loc=5, scale=10, size=(1000,))

To run kstest on these data we need a function f(x) that takes an array of quantiles, and returns the corresponding value of the cumulative distribution function. If we reuse the cdf function of scipy.stats.norm we could do:

>>> scipy.stats.kstest(data, lambda x: scipy.stats.norm.cdf(x, loc=5, scale=10))
(0.019340993719575206, 0.84853828416694665)

The above would normally be run with the more convenient form:

>>> scipy.stats.kstest(data, 'norm', args=(5, 10))
(0.019340993719575206, 0.84853828416694665)

If we have uniformly distributed data, it is easy to build the cdf by hand:

>>> data = np.random.rand(1000)
>>> scipy.stats.kstest(data, lambda x: x)
(0.019145675289412523, 0.85699937276355065)

edited Apr 21 '18 at 10:41

Дмитрий Федосов

3
2

answered Jul 27 '13 at 23:02

Jaime

65,696
17
124
159

Thank you, works great now! Something is confusing me though. When I follow your example, I get D = 0.08, p = 1.6e-14. In my original question, I mentioned my 'hack' solution for using ks_2samp instead: I used the histogram module to compute the observed frequencies of the data, computed the theoretical frequencies for the same bin sizes, and used ks_2samp on these two arrays. That gave me D = 0.74, p = 0.017. It seems a bit strange to me that this would give such a drastically different result. Do you think the two calculations should be closer? – Jul 28 '13 at 04:25
Hang on, I may have confused myself: does ks_2samp take the empirical cdf of the two data sets, or the two data sets themselves? – Jul 28 '13 at 04:34
`ks_2samp` takes the two data sets themselves. If you are doing things properly, I think it seems reasonable that you `ks_2samp` method would yield higher `p-values` than `kstest`, not sure if the difference you are seeing is too large or not... – Jaime Jul 28 '13 at 04:42
Got it now. Using the correct inputs, I can make the p values from kstest and ks_2samp converge by taking a large enough sample from the theoretical distribution. Thanks for your help! I wish I could vote your answer up, but that will have to wait until I have enough rep to do it. – Jul 28 '13 at 05:18

score 4 · Answer 2 · edited Mar 01 '15 at 09:25

4

as for ks_2samp, it tests null hypothesis that both samples are sampled from same probability distribution.

you can do for example:

>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>>

where x, y are two instances of numpy.array:

>>> ks_2samp(x, y)
(0.022999999999999909, 0.95189016804849658)

first value is the test statistics, and second value is the p-value. if the p-value is less than 95 (for a level of significance of 5%), this means that you cannot reject the Null-Hypothese that the two sample distributions are identical.

edited Mar 01 '15 at 09:25

Keller Scholl

301
3
11

answered Oct 06 '13 at 07:27

kiriloff

25,609
37
148
229

From ks_2samp documentation: If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same. – Volodimir Kopey Mar 26 '18 at 17:28

Using Scipy's stats.kstest module for goodness-of-fit testing

2 Answers2

Linked