3

I guess this is a simple question, but I can't sort it out. I have a vector, the first elements of which look like:

V = [31 52 38 29 29 34 29 24 25 25 32 28 24 28 29 ...];

and I want to perform a chi2gof test in Matlab to test if V is exponentially distributed. I did:

[h,p] = chi2gof(V,'cdf',@expcdf);

but I get a warning message saying:

Warning: After pooling, some bins still have low expected counts.
The chi-square approximation may not be accurate

Have I defined the chi2gof call incorrectly?

horchler
  • 18,384
  • 4
  • 37
  • 73
Oliver Amundsen
  • 1,491
  • 2
  • 21
  • 40

1 Answers1

3

At 36 values, you have a very small sample set. From the second sentence of Wikipedia's article on the chi-squared test (emphasis added):

It is suitable for unpaired data from large samples.

Large in this case usually means around at least 100. Read about more assumptions of this test here.


Alternatives

You might try kstest in Matlab, which is based on the Kolmogorov-Smirnov test:

[h,p] = kstest(V,'cdf',[V(:) expcdf(V(:),expfit(V))])

Or try lillietest, which is based on the Lilliefors test and has an option specifically for exponential distributed data:

[h,p] = lillietest(V,'Distribution','exp')

In case you can increase your sample size, you are doing one thing wrong with chi2gof. From the help for the 'cdf' option:

A fully specified cumulative distribution function. This can be a ProbabilityDistribution object, a function handle, or a function. name. The function must take X values as its only argument. Alternately, you may provide a cell array whose first element is a function name or handle, and whose later elements are parameter values, one per cell. The function must take X values as its first argument, and other parameters as later arguments.

You're not supplying any additional parameters, so expcdf is using the default mean parameter of mu = 1. Your data values are very large and don't correspond at all an exponential distribution with this mean. You need to estimate parameters as well. You the expfit function, which is basted on maximum likelihood expectation, you might try something like this:

[h,p] = chi2gof(V,'cdf',@(x)expcdf(x,expfit(x)),'nparams',1)

However, with only 36 samples you may not get a very good estimate for a distribution like this and still may not get expected results even for data sampled from a known distribution, e.g.:

V = exprnd(10,1,36);
[h,p] = chi2gof(V,'cdf',@(x)expcdf(x,expfit(x)),'nparams',1)
horchler
  • 18,384
  • 4
  • 37
  • 73
  • Fantastic explanation. Thanks so much. Could you please suggest additional literature for backing up that a large sample for Chi2 is about 100? – Oliver Amundsen Dec 03 '14 at 11:21
  • Moreover, what if the lilliefors accepts the null hypothesis, and KS rejects it? It seems that this is happening to me, unless I made a mistake in the commands. – Oliver Amundsen Dec 03 '14 at 11:26
  • That is a well-know property of the chi-squared test and will be found in any good text. 100 is a rule of thumb. The point is that you need lots of them for the test to work well and you need a few in each bin/cell the test divides them into. I've updated my answer with an additional resource. Keep in mind that StackOverflow is geared toward programming; not Math/Statistics, per se. If you have a question about which test to use in which situation, it would probably be better-suited for [Cross Validated](http://stats.stackexchange.com) or [Math.StackExchange](http://math.stackexchange.com). – horchler Dec 03 '14 at 17:51