Tidyverse solution to grouping, and looping through dataframe to perform dunn.test

Question

I have an example dataframe below.

example.df <- data.frame( 
species = sample(c("primate", "non-primate"), 50, replace = TRUE),
treated = sample(c("Yes", "No"), 50, replace = TRUE), 
gender = sample(c("male", "female"), 50, replace = TRUE), 
var1 = rnorm(50, 100, 5), resp=rnorm(50, 10,5), value1 = rnorm (50, 25, 5))

I want to group by treated first, and then loop over all numeric variables in the data to perform a dunn test (pairw.kw from the asbio package) using species as the explanatory variable, extract the summary dataframe object and bind the columns from the yes and no sub-lists into a new dataframe object.

I have already obtained a partial solution here and here using a partially "tidy" approach which works quite well. I am just looking for a more elegant tidyverse solution in order to help me learn to be a better R user.

Any help is appreciated.

Edit: This is the output I have from the code in the partially "tidy" solution.

structure(list(var1.Diff = structure(1:2, .Label = c("-7.05229", 
"-2.25"), class = "factor"), var1.Lower = structure(1:2, .Label = 
c("-13.23198", 
"-8.25114"), class = "factor"), var1.Upper = structure(1:2, .Label = 
c("-0.87259", 
"3.75114"), class = "factor"), var1.Decision = structure(1:2, .Label = 
 c("Reject H0", 
 "FTR H0"), class = "factor"), var1.Adj..P.value = structure(1:2, .Label = 
 c("0.025305", 
 "0.462433"), class = "factor"), resp.Diff = structure(1:2, .Label = 
 c("1.10458", 
 "0"), class = "factor"), resp.Lower = structure(1:2, .Label = c("-5.07512", 
 "-6.00114"), class = "factor"), resp.Upper = structure(1:2, .Label = 
 c("7.28427", 
 "6.00114"), class = "factor"), resp.Decision = structure(c(1L, 
 1L), .Label = "FTR H0", class = "factor"), resp.Adj..P.value = 
 structure(1:2, .Label = c("0.726092", 
 "1"), class = "factor"), effect.Diff = structure(1:2, .Label = 
 c("-1.27451", 
 "-0.5625"), class = "factor"), effect.Lower = structure(1:2, .Label = 
 c("-7.4542", 
 "-6.56364"), class = "factor"), effect.Upper = structure(1:2, .Label = 
 c("4.90518", 
 "5.43864"), class = "factor"), effect.Decision = structure(c(1L, 
 1L), .Label = "FTR H0", class = "factor"), effect.Adj..P.value = 
 structure(1:2, .Label = c("0.686047", 
 "0.85424"), class = "factor")), row.names = c("No", "Yes"), class = 
 "data.frame")

Well the `broom` package is great because it turns the results of statistical tests into data frames that you can then manipulate, but I know for a fact that it doesn't work on the output of the dunn test. — gatsky, Mar 04 '19 at 10:25
Yes. I can second that the `broom` package does not cover the dunn test results. — jaydoc, Mar 04 '19 at 10:35

dipetkov · Accepted Answer · 2022-03-16T11:20:54.800

Update 2022/03/16

The tidyverse has evolved and so should this solution.

library("asbio")
#> Loading required package: tcltk
library("tidyverse")

# It is good practice to set the seed before simulating random data
# Otherwise we can't compare
set.seed(12345)

example.df <-
  tibble(
    species = sample(c("primate", "non-primate"), 50, replace = TRUE),
    treated = sample(c("Yes", "No"), 50, replace = TRUE),
    gender = sample(c("male", "female"), 50, replace = TRUE),
    var1 = rnorm(50, 100, 5), resp = rnorm(50, 10, 5), value1 = rnorm(50, 25, 5)
  ) %>%
  # Make sure species is a factor as required by `pair.kw`
  mutate(
    species = factor(species)
  )

example.df %>%
  # We want to perform the test separately for each group
  group_by(
    treated
  ) %>%
  group_modify(
    # Perform test and extract summary
    ~ pairw.kw(.x$var1, .x$species, conf = 0.95)$summary
  )
#> # A tibble: 2 × 6
#> # Groups:   treated [2]
#>   treated Diff     Lower    Upper    Decision `Adj. P-value`
#>   <chr>   <chr>    <chr>    <chr>    <chr>    <chr>         
#> 1 No      -2.67949 -8.12299 2.76402  FTR H0   0.334663      
#> 2 Yes     4.07071  -2.98047 11.12188 FTR H0   0.257843

It is straightforward to extend this approach to run 6 tests, one for each combination of a treatment group and a response variable (var1, value1 or resp). For example we can convert the tibble from wide format (three response columns) to narrow format (three responses stacked row-wise) and then proceed pretty much as above.

responses <- c("value1", "var1", "resp")

example.df %>%
  pivot_longer(
    all_of(responses),
    names_to = "variable"
  ) %>%
  group_by(
    treated, variable
  ) %>%
  group_modify(
    ~ pairw.kw(.x$value, .x$species, conf = 0.95)$summary
  )
#> # A tibble: 6 × 7
#> # Groups:   treated, variable [6]
#>   treated variable Diff     Lower     Upper    Decision `Adj. P-value`
#>   <chr>   <chr>    <chr>    <chr>     <chr>    <chr>    <chr>         
#> 1 No      resp     -1.46154 -6.90505  3.98197  FTR H0   0.598725      
#> 2 No      value1   4.62821  -0.8153   10.07171 FTR H0   0.095632      
#> 3 No      var1     -2.67949 -8.12299  2.76402  FTR H0   0.334663      
#> 4 Yes     resp     3.75758  -3.2936   10.80875 FTR H0   0.29627       
#> 5 Yes     value1   -2.97475 -10.02592 4.07643  FTR H0   0.408311      
#> 6 Yes     var1     4.07071  -2.98047  11.12188 FTR H0   0.257843

^{Created on 2022-03-16 by the reprex package (v2.0.1)}

Old solution

Here is a tidy approach to running multiple tests simultaneously.

library("asbio")
#> Loading required package: tcltk
library("tidyverse")

# It is good practice to set the seed before simulating random data
# Otherwise can't compare
set.seed(12345)

example.df <- tibble(
  species = sample(c("primate", "non-primate"), 50, replace = TRUE),
  treated = sample(c("Yes", "No"), 50, replace = TRUE),
  gender = sample(c("male", "female"), 50, replace = TRUE),
  var1 = rnorm(50, 100, 5), resp=rnorm(50, 10,5), value1 = rnorm (50, 25, 5)) %>%
  # Make sure species is a factor as required by `pair.kw`
  mutate(species = factor(species))

example.df %>%
  # Nest the data for each treatment group
  nest(- treated) %>%
  # Run the test on each treatment group
  mutate(test = map(data, ~ pairw.kw(.$var1, .$species, conf = 0.95))) %>%
  # There is no broom::tidy method for objects of class pairw
  # but we can extract the summary ourselves
  mutate(summary = map(test, pluck, "summary")) %>%
  # Cast all the factor columns in the summary table to character, to
  # avoid a warning about converting factors to characters.
  mutate(summary = map(summary, mutate_if, is.factor, as.character)) %>%
  unnest(summary)
#> # A tibble: 2 x 8
#>   treated data        test     Diff    Lower  Upper Decision `Adj. P-value`
#>   <chr>   <list>      <list>   <chr>   <chr>  <chr> <chr>    <chr>         
#> 1 No      <tibble [2~ <S3: pa~ -1.705~ -7.99~ 4.58~ FTR H0   0.595163      
#> 2 Yes     <tibble [2~ <S3: pa~ -1.145~ -6.45~ 4.16~ FTR H0   0.672655

It is straightforward to extend this approach to run 6 tests, one for each combination of a treatment group and a response variable (var1, value1 or resp). For example we can convert the tibble from wide format (three response columns) to narrow format (three responses stacked rowwise) and then proceed pretty much as above.

example.df %>%
  # From wide format to narrow format
  gather(varname, y, value1, var1, resp) %>%
  # Nest the data for each treatment group and each variable
  nest(- treated, - varname) %>%
  # Run 6 tests = (number of treatments) * (number of response variables)
  mutate(test = map(data, ~ pairw.kw(.$y, .$species, conf = 0.95))) %>%
  # There is no broom::tidy method for objects of class pairw
  # but we can extract the summary ourselves
  mutate(summary = map(test, pluck, "summary")) %>%
  # Cast all the factor columns in the summary table to character, to
  # avoid a warning about converting factors to characters.
  mutate(summary = map(summary, mutate_if, is.factor, as.character)) %>%
  unnest(summary)
#> # A tibble: 6 x 9
#>   treated varname data    test   Diff   Lower Upper Decision `Adj. P-value`
#>   <chr>   <chr>   <list>  <list> <chr>  <chr> <chr> <chr>    <chr>         
#> 1 No      value1  <tibbl~ <S3: ~ 3.127~ -3.1~ 9.41~ FTR H0   0.329969      
#> 2 Yes     value1  <tibbl~ <S3: ~ -1.33~ -6.65 3.97~ FTR H0   0.622065      
#> 3 No      var1    <tibbl~ <S3: ~ -1.70~ -7.9~ 4.58~ FTR H0   0.595163      
#> 4 Yes     var1    <tibbl~ <S3: ~ -1.14~ -6.4~ 4.16~ FTR H0   0.672655      
#> 5 No      resp    <tibbl~ <S3: ~ 1.421~ -4.8~ 7.71~ FTR H0   0.657905      
#> 6 Yes     resp    <tibbl~ <S3: ~ 1.145~ -4.1~ 6.45~ FTR H0   0.672655

^{Created on 2019-03-04 by the reprex package (v0.2.1)}

And what if you want to be flexible about the number and the name of responses? Then specify a list of responses:

responses <- c("value1", "var1", "resp")

And use tidy evaluation in the gather statement as follows:

example.df %>%
  # From wide format to narrow format, with tidy evaluation
  gather(varname, y, !!!rlang::syms(responses))
  # Rest of computation here.....

Thanks. That solves the grouping and extracting the data to a tibble parts of my question, but how about looping over all the numeric variables using tidy methods? — jaydoc, Mar 04 '19 at 10:59
You can add more nesting variables. It is actually not entirely clear to me what you mean by looping over all the numeric variables. — dipetkov, Mar 04 '19 at 12:09
I was referring to whether it is possible to modify the code to run the dunn test for `resp` and `value1` in addition to `var1` in this example using just a single execution of the code. — jaydoc, Mar 04 '19 at 12:17
Just to make sure I understand -- run three tests `pairw.kw(var1, species)` and `pairw.kw(value1, species)` and `pairw.kw(resp, species)`? Report results from three tests at the end? In three rows or in three columns? — dipetkov, Mar 04 '19 at 13:16
Yes. I was thinking that it would be 6 rows (3 variables x 2 per variable as there are two groups, "yes" and "no" in this dataset). — jaydoc, Mar 04 '19 at 13:26
I have edited the question to add `dput(res.df)` results as an example of the dataframe I am trying to get using the `tidy` approach. — jaydoc, Mar 04 '19 at 13:39
Can you check, ideally without any tidy manipulations, that the p-values are computed correctly? I just realized that `example.df` is random and we didn't set the random seed, so it is difficult to compare results. — dipetkov, Mar 04 '19 at 14:15
That is so neat! The values are the same as what they are without tidy manipulations. I came across another example converting to long format around here, but could not grasp how that would help. I could try to figure this out myself, but my actual data has more than 50 variables and I am trying to avoid typing them in one after the other in the `gather` step. Any suggestions? Thanks for your rapid responses. You have been most helpful in explaining `tidyverse` mechanics. — jaydoc, Mar 04 '19 at 14:31
You can use tidy evaluation for that, I've added an example at the end. As an aside, I learned about this general workflow from the "Many Models" section of the R for Data Science book https://r4ds.had.co.nz/. — dipetkov, Mar 04 '19 at 14:55

Tidyverse solution to grouping, and looping through dataframe to perform dunn.test

1 Answers1