Chi Square Test of Independence of Whole Dataset

Question

I have a 3185x90 dataset of binary values and want to do a chi-squared test of independence, comparing all column variables against each other.

I've been tried using different variations of code from google searches with chisq.test() and some for loops, but none of them have worked so far.

How do I do this?

This is the frame I've tinkered with. My dataset is oak.

chi_trial <- data.frame(a = c(0,1), b = c(0,1))
for(row in 1:nrow(oak)){
  print(row)
  print(chisq.test(c(oak[row,1],d[row,2])))
}

I also tried this:

apply(d, 1, chisq.test)

which gives me the error: Error in FUN(newX[, i], ...) : all entries of 'x' must be nonnegative and finite


dput(oak[1:2],)
structure(list(post_flu = structure(c(1, 1, 1, 1, 1, 0, 0, 0, 
0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 
0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,

label = "Receipt of Flu Vaccine - Encounter Survey", format.stata = "%10.0g")), row.names = c(NA, 
-3185L), class = c("tbl_df", "tbl", "data.frame"), label = "Main Oakland Clinic Analysis Dataset")

I added a sample of my data with the final lines of the output. The portion of the dataset is small, but it all looks like this.

Hi Joseph, it's not clear what rows or columns you want to perform the `chisq.test` on. Can you please clarify? How do you intend to correct for multiple testing? Additionally, it will be much easier to help if you provide at least a sample of your data `dput(d[1:20,])`. You can [edit] your question and paste the output. Please surround the output with three backticks (```) for better formatting. See [How to make a reproducible example](https://stackoverflow.com/questions/5963269/) for more info. — Ian Campbell, Jun 14 '20 at 04:43
I second Ian's comment. Also, you say your goal is "comparing all variables", but the comparisons in your for-loop are row-wise. Data frame rows are observations, while *columns* are variables. χ2 also doesn't really make sense for observation-wise comparisons, as a rule, although I suppose there might be occasional exceptions. — , Jun 14 '20 at 05:05
+1 to both @IanCampbell (and gersht). Even if you get an answer to your specific question, that may not be doing you a service in the long run. You have deeper issues to consider before you can be sure that what you are doing is correct or appropriate. — Limey, Jun 14 '20 at 07:39
I added some data. I want to know if there are significant differences in variable/column frequencies. Lastly, I don't plan on having a career in coding; using r is required for the research I'm doing this summer. The code I use usually comes from other stacked posts, so I'm unsure of which ones will work. — Joseph, Jun 14 '20 at 13:17

score 2 · Answer 1 · answered Jun 14 '20 at 05:11

You could use something like the code below, which is similar to R's cor function. I don't have your data, so I'm simulating some. Note that I get one significant p-value, using the traditional cut-off of 0.05.

set.seed(3)
nr=3185; nc=3

oak <- as.data.frame(matrix(sample(0:1, size=nr*nc, replace=TRUE), ncol=nc))
oak

mult.chi <- function(data){
  nc <- ncol(data)
  res <- matrix(0, nrow=nc, ncol=nc) # or NA
  for(i in 1:(nc-1))
    for(j in (i+1):nc)
      res[i,j] <- suppressWarnings(chisq.test(oak[,i], oak[,j])$p.value)
  rownames(res) <- colnames(data)
  colnames(res) <- colnames(data)
  res
}

mult.chi(oak)

#    V1        V2         V3
# V1  0 0.7847063 0.32012466
# V2  0 0.0000000 0.01410326
# V3  0 0.0000000 0.00000000

So consider applying a multiple testing adjustment as mentioned in the comments.

score 1 · Answer 2 · answered Jun 14 '20 at 07:24

Here is a solution with combn to get all combinations of column numbers 2 by 2. Tested with the data in @Edward's answer.

chisq2cols <- function(X){
  y <- matrix(0, ncol(X), ncol(X))
  cmb <- combn(ncol(X), 2)
  y[upper.tri(y)] <- apply(cmb, 2, function(k){
    tbl <- table(X[k])
    chisq.test(tbl)$p.value
  })
  y
}

chisq2cols(oak)
#     [,1]      [,2]       [,3]
#[1,]    0 0.7847063 0.32012466
#[2,]    0 0.0000000 0.01410326
#[3,]    0 0.0000000 0.00000000

Chi Square Test of Independence of Whole Dataset

2 Answers2