Doing chisq.test on data frame for multiple pairwise comparisons

Question

I have the following dataframe:

species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
df <- cbind(species, category, minus, plus)
df<-as.data.frame(df)

I want to do a chisq.test for each category-species combination, like this:

Species a, category h and l: p-value

Species a, category h and m: p-value

Species a, category l and m: p-value

Species b, ... and so on

With the following chisq.test (dummy code):

chisq.test(c(minus(cat1, cat2),plus(cat1, cat2)))$p.value

I want to end up with a table that presents each chisq.test p-value for each comparison, like this:

Species   Category1  Category2   p-value
a         h          l           0.05
a         h          m           0.2
a         l          m           0.1
b...

Where category and and category 2 are the compared categories in the chisq.test.

Is this possible to do using dplyr? I have tried tweaking what was mentioned in here and here, but they don't really apply to this issue, as I am seeing it.

EDIT: I also would like to see how this could be done for the following dataset:

species <- c(1:11)
minus <- c(132,78,254,12,45,76,89,90,100,42,120)
plus <- c(1,2,0,0,0,3,2,5,6,4,0)

I would like to do a chisq. test for each species in the table compared to every single other species in the table (a pairwise comparison between each species for all species). I want to end up with something like this:

species1  species2  p-value
1         2         0.5
1         3         0.7
1         4         0.2
...
11        10        0.02

I tried changing the code above to the following:

species_chisq %>%
do(data_frame(species1 = first(.$species),
            species2 = last(.$species),
            data = list(matrix(c(.$minus, .$plus), ncol = 2)))) %>%
mutate(chi_test = map(data, chisq.test, correct = FALSE)) %>%
mutate(p.value = map_dbl(chi_test, "p.value")) %>%
ungroup() %>%
select(species1, species2, p.value) %>%

However, this only created a table where each species was only compared to itself, and not the other species. I do not quite understand where in the original code given by @ycw it specifies which are compared.

EDIT 2:

I managed to do this by the code found here.

score 2 · Accepted Answer · answered Sep 21 '17 at 13:01

A solution from dplyr and purrr. Notice that I am not familiar with chi-square test, but I follow the way you specified in @Vincent Bonhomme's post: chisq.test(test, correct = FALSE).

In addition, to create example data frame, there is no need to use cbind, just data.frame would be sufficient. stringsAsFactors = FALSE is important to prevent columns become factor.

# Create example data frame
species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
df <- data.frame(species, category, minus, plus, stringsAsFactors = FALSE)

# Load packages
library(dplyr)
library(purrr)

# Process the data
df2 <- df %>%
  group_by(species) %>%
  slice(c(1, 2, 1, 3, 2, 3)) %>%
  mutate(test = rep(1:(n()/2), each = 2)) %>%
  group_by(species, test) %>%
  do(data_frame(species = first(.$species),
                test = first(.$test[1]),
                category1 = first(.$category),
                category2 = last(.$category),
                data = list(matrix(c(.$minus, .$plus), ncol = 2)))) %>%
  mutate(chi_test = map(data, chisq.test, correct = FALSE)) %>%
  mutate(p.value = map_dbl(chi_test, "p.value")) %>%
  ungroup() %>%
  select(species, category1, category2, p.value)

df2
# A tibble: 25 x 4
   species category1 category2   p.value
     <chr>     <chr>     <chr>     <dbl>
 1       a         h         l 0.3465104
 2       a         h         m 0.1354680
 3       a         l         m 0.6040227
 4       b         h         l 0.2339414
 5       b         h         m 0.4798647
 6       b         l         m 0.4399181
 7       c         h         l 0.4714005
 8       c         h         m 0.6987413
 9       c         l         m 0.5729834
10       d         h         l 0.2196806
# ... with 15 more rows

Could you elaborate on what the slice(c(1,2,1,3,2,3)) command does? I didn't understand the ?slice help. — Haakonkas, Sep 22 '17 at 11:26
I am also trying to make this work for a dataframe with only species, where I do a pair-wise comparison between each species (no categories this time). One column with species, one column with minus, and one column with plus (similar as above). If I want to compare each species with one another, and create a table of the values similarly as above, how would you change the code? As I am not quite sure how the slice() works, I find it hard to change it! — Haakonkas, Sep 22 '17 at 11:41
slice select rows based on index number. So slice(c(1, 2, 1, 3, 2, 3)) means get the first, second, first, third, second, third rows. — www, Sep 22 '17 at 12:13
Thanks! I tried changing it to only cover what I presented in the comment above, but I was not successful. Do you have a solution for this? — Haakonkas, Sep 26 '17 at 11:58

Vincent Bonhomme · Answer 2 · 2017-09-21T12:44:23.547

First, you should create your data.frame with data.frame, otherwise minus and plus columns are turned into factors.

species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
df <- data.frame(species=species, category=category, minus=minus, plus=plus)

Then, I'm not sure there is a pure dplyr way to do it (would be glad to be shown the contrary), but I think here is a partly-dplyr way to do it:

df_combinations <-
  # create a df with all interactions
  expand.grid(df$species, df$category, df$category)) %>% 
  # rename columns
  `colnames<-`(c("species", "category1", "category2")) %>% 
  # 3 lines below:
  # manage to only retain within a species, category(1 and 2) columns
  # with different values
  unique %>% 
  group_by(species) %>% 
  filter(category1 != category2) %>% 
  # cosmetics
  arrange(species, category1, category2) %>%
  ungroup() %>% 
  # prepare an empty column
  mutate(p.value=NA)

# now we loop to fill your result data.frame
for (i in 1:nrow(df_combinations)){
  # filter appropriate lines
  cat1 <- filter(df,
                 species==df_combinations$species[i],
                 category==df_combinations$category1[i])
  cat2 <- filter(df,
                 species==df_combinations$species[i],
                 category==df_combinations$category2[i])
  # calculate the chisq.test and assign its p-value to the right line
  df_combinations$p.value[i] <- chisq.test(c(cat1$minus, cat2$minus,
                                             cat1$plus, cat2$plus))$p.value  

}

Let's have a look to the resulting data.frame:

head(df_combinations)
# A tibble: 6 x 4
# A tibble: 6 x 4
# Groups:   species [1]
species category1 category2       p.value
<fctr>    <fctr>    <fctr>         <dbl>
1       a         h         l  3.290167e-11
2       a         h         m 1.225872e-134
3       a         l         h  3.290167e-11
4       a         l         m 5.824842e-150
5       a         m         h 1.225872e-134
6       a         m         l 5.824842e-150

Checking the first row: chisq.test(c(31, 14, 2, 0))$p.value [1] 3.290167e-11

Is this what you wanted?

Great suggestion! It is indeed what I am looking for. However, I don't seem to get the same p-values when I do a single chisq.test on each category pair separately, and almost every combination here is significant! Can yuo think of any reason for this? — Haakonkas, Sep 21 '17 at 12:27
could you give a (real) example of the chisq.test you want please? — Vincent Bonhomme, Sep 21 '17 at 12:35
don't worry, got it. cbind then as.data.frame turns numeric into factors. I modify my answer. — Vincent Bonhomme, Sep 21 '17 at 12:42
This is the way that i usually do it: test <- matrix(c(31,14,2,0),ncol = 2) chisq.test(test, correct = F)$p.value This gives me a p-value of 0.34. Is this method incorrect? — Haakonkas, Sep 21 '17 at 12:48

Doing chisq.test on data frame for multiple pairwise comparisons

2 Answers2