3

I'm trying to perform Pearson Chi-square test on goodness of fit. Here is an example of fitting a Poisson distribution:

data <- rpois(200,50)
estimate <- mean(data)
freq.os<-table(data)
yfit <- dpois(as.integer(names(freq.os)), estimate)

chisq.test(x = freq.os, p = yfit)
# Error in chisq.test(x = freq.os, p = yfit) : probabilities must sum to 1.

When I evaluate sum(yfit), I get 0.999839.

So how to produce a set of probability values that add up to 1? Thank you!


EDIT

Actually I found a work around:

chisq.test(freq.os, yfit)

but I am very confused as to how chisq.test() works as it is telling me df = 429. I thought df = n - k - 1, which in this case should be 35, where k = 1 for the lambda and n = number of terms in the chi-square sum. Where am I doing wrong?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Jerry W
  • 185
  • 1
  • 2
  • 10

1 Answers1

6

Comments above suggested you to manually rescale yfit, before passing it to chisq.test. Well actually you can let chisq.test() do this for you:

chisq.test(x = freq.os, p = yfit, rescale.p = TRUE)

Regarding your Edit

chisq.test(freq.os, yfit)

is incorrect, as it is doing independence test.

chisq.test() can perform two statistical test:

  1. goodness of fit test, using argument x and p;
  2. independence test, using argument x and y.

Please read ?chisq.test carefully. For goodness of fit test, you must use p argument in the function, as you initially did. My answer above, using rescale.p = TRUE will help you get around the error you saw.


How to do Pearson Chi-square test

You said you don't know how the test is done, then read this part.

You should use set.seed(), so that when others run your code, they get the same random numbers as you got. Here is a reproducible example:

N <- 200    ## number of samples
set.seed(0)    ## fix random seed for reproducibility
x <- rpois(N,50)    ## generate sample
lambda <- mean(x)    ## estimate mean

Now we aim to use Pearson Chi-square test to test goodness of fit. We have

Null Hypothesis: x ~ Poisson(lambda)

O <- table(x)    ## contingency table / observed frequency
n <- length(O)    ## number of categories
# 31
x0 <- as.integer(names(O))    ## unique sample values
p <- dpois(x0, lambda); p <- p / sum(p)    ## theoretical probability under Null Hypothesis

Let's first perform Chi-square test ourselves.

E <- p * N    ## expected frequency under Null Hypothesis
R <- (O - E) / sqrt(E)    ## standardised residuals
X <- sum(R * R)    ## Chi-square statistic
# 36.13962
dof <- n - 1    ## degree of freedom
# 30
p.value <- 1 - pchisq(X, df = dof)    ## p-value
# 0.2035416

The p-value is larger than 0.05, hence we can not reject Null Hypothesis. Summary plot:

z <- curve(dchisq(x, df = dof), from = 0, to = 100)
abline(v = X, lty = 3)
polygon(c(X, X, z$x[z$x > X]), c(0, dchisq(X, df = dof), z$y[z$x > X]), col = 2)

enter image description here

Then we use chisq.test()

chisq.test(O, p = p)

#Chi-squared test for given probabilities

#data:  O
#X-squared = 36.14, df = 30, p-value = 0.2035

You can see, the result is the same.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248