Chi-squared test of independence on all combinations of columns in a dataframe in R

Question

this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.

I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):

> head(plots[,1:4])
 Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1               1                   0                 0                     0
3               0                   0                 0                     0
4               0                   0                 0                     0
5               0                   0                 0                     0
6               0                   0                 0                     0
8               0                   0                 0                     0

I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.

I have made a start by writing the following function:

CHI <- function(sppx, sppy) 
{test <- chisq.test(table(sppx, sppy)) 
result <- c(test$statistic, test$p.value,
        sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}

This returns the following:

> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)

X-squared                             
1.095869e-27  1.000000e+00 -1.000000e+00 
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect

Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out. I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.

score 2 · Answer 1 · answered May 23 '16 at 14:21

2

You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:

apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))

answered May 23 '16 at 14:21

Psidom

209,562
33
339
356

Jeonifer · Answer 2 · 2016-05-24T13:22:06.667

1

I think you are looking for something like this. I used the iris dataset.

require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))

edited May 24 '16 at 13:22

answered May 23 '16 at 14:37

Jeonifer

74
2

@Psidom you were super fast; sorry, I can't delete my answer as there is no delete button. This will teach me to refresh prior to posting. – Jeonifer May 23 '16 at 14:49
Thanks very much for your help! I tried using this code for my data: ` > plotc <- combn(NCOL(plots),2) > > lapply(1:NCOL(data), function (i) CHI(plots[,plotc[1,i]],plots[,plotc[2,i]])) [[1]] X-squared 1.095869e-27 1.000000e+00 -1.000000e+00 ` This returns a single output. I would like to output the result of each individual chi-squared test - one for each element in the combination table. Apologies, that may not have been clear in my original question. Do you know of a way to achieve this? Thanks again – YJS May 24 '16 at 11:06
You have in specified in your code to give you only one output. If you want them all delete the `[[1]]` at the end of your code and have just `plotc <- combn(NCOL(plots),2); lapply(1:NCOL(data), function (i) CHI(plots[,plotc[1,i]],plots[,plotc[2,i]]))` – Jeonifer May 24 '16 at 12:49
Sorry, that was poor formatting on my part - I thought putting " ` " on either side of my code segment would make it display as code but apparently not. The " [[1]] " is from the first line of R output from the console, not my code. I am running the code as you have just displayed it in your comment, and it is returning a list of length 1. – YJS May 24 '16 at 13:14
I see where the problem is change `NCOL(data)` to `NCOL(plotc)`. In my code I originally named the variable `data` rather than `ind` and I didn't change it in the `lapply`. I'll edit my code. – Jeonifer May 24 '16 at 13:20
Hmm... I've tried changing NCOL(data) to NCOL(plotc) and now I receive the following message: Error in (table(sppx, sppy) - test$expected)[2, 2] : subscript out of bounds Which is the same error message I got when I tried the code in Psidom's answer. I tried running the code lapply(1:NCOL(ind),... and that produced the correct values, but only creates a list of 10. I feel that's getting closer though! – YJS May 24 '16 at 13:43
See if changing the `CHI` function works. `CHI <- function(sppx, sppy) {test <- chisq.test(sppx, sppy) result <- c(test$statistic, test$p.value,sign((test$observed - test$expected)[2,2])) return(result) }` – Jeonifer May 24 '16 at 13:57
Tried changing the CHI function in the way you described, still getting a list of 10. Very strange.. – YJS May 24 '16 at 14:39
What is the `dim(plotc)` and `class(plotc)`? – Jeonifer May 24 '16 at 14:43
dim 22556, class "matrix". – YJS May 24 '16 at 16:44

score 1 · Answer 3 · answered May 14 '17 at 10:11

The below R code run chisquare test for every categorical variable / every factor of a r dataframe, against a variable given (x or y chisquare parameter is kept stable, is explicitly defined):

Define your variable Please - change df$variable1 to your desired factor variable and df to your desirable dataframe that contain all the factor variables tested against the given df$variable1

Define your Dataframe A new dataframe is created (df2) that will contain all the chi square values / dfs, p value of the given variable vs dataframe comparisons

Code created / completed/ altered from similar posts in stackoverflow, neither that produced my desired outcome. Chi-Square Tables statistic / df / p value for variable vs dataframe "2" parameter define column wide comparisons - check apply (MARGIN) option.

df2 <- t(round(cbind(apply(df, 2, function(x) {
  ch <- chisq.test(df$variable1, x)
  c(unname(ch$statistic), ch$parameter, ch$p.value )})), 3))

Chi-squared test of independence on all combinations of columns in a dataframe in R

3 Answers3

Linked