0

I find the way to interpret the shape of random sample has big impact on kstest. I try the following codes:

import numpy as np
from scipy import stats

N = 260
np.random.seed(1)
X = np.random.rand(N)
Xarray = X.reshape(N,1)
XarrayT = Xarray.T

print('X' + str(X.shape) + ': ' + str(stats.kstest(X, 'uniform') ) )
print( 'Xarray' + str(Xarray.shape) + ':' + str( stats.kstest(Xarray, 'uniform') ) )
print( 'XarrayT' + str(XarrayT.shape) + ': ' + str( stats.kstest(XarrayT, 'uniform') ) )

It gives the results:

X(260,): KstestResult(statistic=0.052396054203786291, pvalue=0.46346349447418866)
Xarray(260, 1):KstestResult(statistic=0.99988562518265511, pvalue=0.0)
XarrayT(1, 260): KstestResult(statistic=0.99988562518265511, pvalue=0.00022874963468977327)

where X, Xarray, XarrayT have the same data, except that they have different shape. And the pvalues are totally different. Is it due to a bug or I miss some point in order to use kstest correctly?

Thanks!

Xihao Li
  • 45
  • 4

1 Answers1

0

Well, the scipy kstest documentation tells us it should be a 1d array.

if we run the following:

print('X ' + 'ndimensions=' + str(X.ndim) + ' ' + (str(stats.kstest(X, 'uniform'))))

We see 1 dimensions in the target array.

output:

X ndimensions=1 KstestResult(statistic=0.052396054203786291, pvalue=0.46346349447418866)

However, when we try our other Xarray:

print('Xarray ' + 'ndimensions=' + str(Xarray.ndim) + ' ' + (str(stats.kstest(Xarray, 'uniform'))))

Xarray ndimensions=2 KstestResult(statistic=0.99988562518265511, pvalue=0.0)

This would indicate to me the use of two dimensions in the input array is screwing up the Kolmogorov-Smirnov test for goodness of fit.

I would also suggest reading the answers at this stackoverflow question

Dylan
  • 417
  • 4
  • 14
  • Thanks Dylan! I recognize the issue of 1-d array from the kstest document. And I aware of using 1-d array for operation such as ".T" would give not-so-expected results. My concern is 1-d array gives advantage of using kstest but at the same time disadvantage of using operation such as ".T" in Python, which is not a good and consistent way in application. – Xihao Li Oct 27 '17 at 14:16