2

I would like to run a KW-test over certain numerical variables from a data.frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).

Simulated data:

library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)),
ados.tsc   = runif(240, 0, 10),
ados.sa    = runif(240, 0, 10),
ados.rrb   = runif(240, 0, 10))
) %>%
group_by(muttype)
ados.sim <- as.data.frame(Data)

The following code works just fine outside of the loop.

kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data = 
ados.sim)

But it doesn't inside the loop:

for(i in names(ados.sim[,2:4])){  
ados.mtp <- kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")), 
data = ados.sim)
}

I get the error:

Error in terms.formula(formula, data = data) : invalid term in model formula

Anybody who knows how to solve this?

Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
RmyjuloR
  • 369
  • 1
  • 4
  • 13

2 Answers2

3

Try:

results <- list()
for(i in names(ados.sim[,2:4])){  
  results[[i]] <- kruskal.test(formula(paste(i, "~ muttype")), data = ados.sim)
}

This also saves your results in a list and avoids overwriting your results as ados.mtp in every iteration, which I think is not what you intended to do.

Note the following:

for(i in names(ados.sim[,2:4])){  
   print(i)
}
[1] "ados.tsc"
[1] "ados.sa"
[1] "ados.rrb"

That is, i already gives you the name of the column. The problem in your code was that you tried to use it like an integer for subsetting, which turned the outcome into NA.

for(i in names(ados.sim[,2:4])){  
   print(paste((colnames(ados.sim)[i]), "~ muttype"))
}
[1] "NA ~ muttype"
[1] "NA ~ muttype"
[1] "NA ~ muttype"

And just for reference, all of this could also be done in the following two ways that I often prefer since it makes subsequent analysis slightly easier:

First, store all test objects in a dataframe:

library(tidyr)
df <- ados.sim %>% gather(key, value, -muttype) %>% 
      group_by(key) %>% 
      do(test = kruskal.test(x= .$value, g = .$muttype))

You can then subset the dataframe to get the test outcomes:

df[df$key == "ados.rrb",]$test
[[1]]

    Kruskal-Wallis rank sum test

data:  .$value and .$muttype
Kruskal-Wallis chi-squared = 2.2205, df = 2, p-value = 0.3295

Alternatively, get all results directly in a dataframe, without storing the test objects:

library(broom)
df2 <- ados.sim %>% gather(key, value, -muttype) %>% 
       group_by(key) %>% 
       do(tidy(kruskal.test(x= .$value, g = .$muttype)))
df2
# A tibble: 3 x 5
# Groups:   key [3]
       key statistic   p.value parameter                       method
     <chr>     <dbl>     <dbl>     <int>                       <fctr>
1 ados.rrb 2.2205031 0.3294761         2 Kruskal-Wallis rank sum test
2  ados.sa 0.1319554 0.9361517         2 Kruskal-Wallis rank sum test
3 ados.tsc 0.3618102 0.8345146         2 Kruskal-Wallis rank sum test
coffeinjunky
  • 11,254
  • 39
  • 57
  • Many thanks! All options work and especially the last option is very neat! – RmyjuloR May 06 '18 at 19:34
  • 1
    Yeah, I think the combination of `dplyr::group_by` and `broom::tidy` is extremely handy when doing this sort of thing. – coffeinjunky May 06 '18 at 19:36
  • One additional question: In the case I have a very extensive dataframe, and I would want to take specific columns (not all except for "muttype"), how would I specify that in the latter two examples? – RmyjuloR May 06 '18 at 19:46
  • Have a look at `?gather`, which is the function that reshapes your dataframe. Alternatively, just add a `select` between `ados.sim` and `gather` where you include all columns that you will eventually need. – coffeinjunky May 06 '18 at 19:49
  • I think I have figured it out: df3 <- ados.sim %>% gather(key, value, **2:4**) %>% group_by(key) %>% do(tidy(kruskal.test(x= .$value, g = .$muttype))) df3 – RmyjuloR May 06 '18 at 19:49
0

If you can use external libraries:

library(matrixTests)

col_kruskalwallis(ados.sim[,-1], ados.sim[,1])

         obs.tot obs.groups df statistic    pvalue
ados.tsc     240          3  2 0.3618102 0.8345146
ados.sa      240          3  2 0.1319554 0.9361517
ados.rrb     240          3  2 2.2205031 0.3294761
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89