Preserve order of input variables and factor levels in summary table, using dplyr tidyr

Question

I love how easy dplyr and tidyr have made it to create a single summary table with multiple predictor and outcome variables. One thing that got me stumped was the final step of preserving/defining the order of the predictor variables, and their factor levels, in the output table.

I've come up with a solution of sorts (below), which involves using mutate to manually make a factor variable that combines both the predictor and predictor value (eg. "gender_female") with levels in the desired output order. But my solution is a bit long winded if there are many variables, and I wonder if there is a better way?

library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")

set.seed(1234)

dat <- data.frame(
  gender    = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
  ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
  outcome1  = sample(c(TRUE, FALSE), 100, replace = TRUE),
  outcome2  = sample(c(TRUE, FALSE), 100, replace = TRUE)
)

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  # Statement below creates variable for ordering output
  mutate(
    pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
                      levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
                                 paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
  ) %>%
  group_by(pred_ord, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  ungroup() %>%
  spread(key = outcome, value = n) %>%
  separate(pred_ord, c("Predictor", "Pred_value"))

Source: local data frame [9 x 4]

  Predictor Pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1    gender     Female       25       27
2    gender       Male       11       10
3    gender    Unknown       12       15
4 ethnicity      Maori       10        9
5 ethnicity    Pacific        7        7
6 ethnicity      Asian        6       12
7 ethnicity      Other       10        9
8 ethnicity   European        5        4
9 ethnicity    Unknown       10       11
Warning message:
attributes are not identical across measure variables; they will be dropped

The table above is correct in that neither the Predictor nor Predictor values are resorted alphabetically.

EDIT

As requested, this is what is produced if the default ordering (alphabetical) is used. It makes sense in that when the factors are combined they are converted to a character variable and all attributes are dropped.

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  group_by(predictor, pred_value, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  spread(key = outcome, value = n)

Source: local data frame [9 x 4]

  predictor pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1 ethnicity      Asian        6       12
2 ethnicity   European        5        4
3 ethnicity      Maori       10        9
4 ethnicity      Other       10        9
5 ethnicity    Pacific        7        7
6 ethnicity    Unknown       10       11
7    gender     Female       25       27
8    gender       Male       11       10
9    gender    Unknown       12       15
Warning message:
attributes are not identical across measure variables; they will be dropped

You may also want to look at forcats which provides a number of helpers for working with factors. — hadley, Aug 26 '16 at 18:48

alistaire · Accepted Answer · 2016-08-26T02:58:07.033

If you want your data to be factors arranged as such, you'll need to convert them back to factors, as gather coerces to character (which it warns you about). You can use gather's factor_key parameter to take care of predictor, but you'll need to assemble levels for pred_value as it now combines two factors from the original. Simplifying a bit:

library(tidyr)
library(dplyr)

dat %>% 
    gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>%
    group_by(predictor, pred_value) %>% 
    summarise_all(sum) %>%
    ungroup() %>% 
    mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd), 
                                                           fromLast = TRUE))) %>% 
    arrange(predictor, pred_value)

## # A tibble: 9 × 4
##   predictor pred_value outcome1 outcome2
##      <fctr>     <fctr>    <int>    <int>
## 1    gender     Female       25       27
## 2    gender       Male       11       10
## 3    gender    Unknown       12       15
## 4 ethnicity      Maori       10        9
## 5 ethnicity    Pacific        7        7
## 6 ethnicity      Asian        6       12
## 7 ethnicity      Other       10        9
## 8 ethnicity   European        5        4
## 9 ethnicity    Unknown       10       11

Note that you'll need to use unique with fromLast = TRUE to arrange the duplicate "Unknown" values into a single occurrence in the right place; union will put it earlier.

Hack-R · Answer 2 · 2016-08-26T03:11:32.167

5

You can do this in a much more concise and efficient way without special packages:

rbind(aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")], 
                by = list(dat$gender), sum),
      aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")], 
                by = list(dat$ethnicity), sum))

It aggregates the multiple predictors and outcome variables in a simple and direct way, and also avoids having to create that variable that was part of the complicated solution you mentioned.

   Group.1 outcome1 outcome2
1   Female       25       27
2     Male       11       10
3  Unknown       12       15
4    Maori       10        9
5  Pacific        7        7
6    Asian        6       12
7    Other       10        9
8 European        5        4
9  Unknown       10       11

If you want to rename the columns above just assign it to an object (e.g. mytable <-) and rename them (i.e. colnames(mytable) <- c("Pred_value", "outcome1", "outcome2")). You could also scale it up with an apply if there are too many variables to type.

edited Aug 26 '16 at 03:11

answered Aug 26 '16 at 02:15

Hack-R

22,422
14
75
131

2

Thanks, great approach. I need to brush up on my base skills! Suggested an edit to suit my needs. – JWilliman Aug 26 '16 at 02:46
@JWilliman Happy to help. This is scalable to any number of independent and dependent variables. Cheers. – Hack-R Aug 26 '16 at 02:47
It asked for a dplyr approach, anyway this one is good as long as nobody post a good dplyr approach :-) – akrun Aug 26 '16 at 02:56
@alistaire has a `dplyr` answer that looks good. Of course, I could just put `library(dplyr)` at the beginning of my code block and not call any of its functions to make this a `dplyr` answer, right? ;-) – Hack-R Aug 26 '16 at 03:01
I must admit I `dplyr` in the title because that is what I was coding with at the time (and wanted to see if the question had been asked before). But I'll remember to keep my options more open in the future. Honestly, sometimes I feel R has more accents than English, and I don't understand the Queen! – JWilliman Aug 26 '16 at 03:25

score 0 · Answer 3 · answered Oct 15 '20 at 14:43

You can prefix your variables with values that force them into the right order, such as "X1_gender", "X2_ethnicity". The prefixes can be stripped with a mutate at the end. This is probably not a "tidy" solution, but it worked for my purposes on the problem that led me to this post.

library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")

set.seed(1234)

dat <- data.frame(
  X1_gender    = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
  X2_ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
  outcome1  = sample(c(TRUE, FALSE), 100, replace = TRUE),
  outcome2  = sample(c(TRUE, FALSE), 100, replace = TRUE)
)

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, X1_gender, X2_ethnicity) %>%
  group_by(predictor, pred_value, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  spread(key = outcome, value = n) %>%
  mutate(predictor=gsub("^X[0-9]_","", predictor))

Result:

`summarise()` regrouping output by 'predictor', 'pred_value' (override with 
`.groups` argument)
# A tibble: 9 x 4
# Groups:   predictor, pred_value [9]
  predictor pred_value outcome1 outcome2
  <chr>     <chr>         <int>    <int>
1 gender    Female           16       21
2 gender    Male             12       15
3 gender    Unknown          18       16
4 ethnicity Asian             4        6
5 ethnicity European         13       13
6 ethnicity Maori             4        6
7 ethnicity Other             7       11
8 ethnicity Pacific          10        9
9 ethnicity Unknown           8        7
Warning message:
attributes are not identical across measure variables;
they will be dropped

Preserve order of input variables and factor levels in summary table, using dplyr tidyr

3 Answers3

Linked