18

I am trying to create one table that summarizes several categorical variables (using frequencies and proportions) by another variable. I would like to do this using the dplyr package.

These previous Stack Overflow discussions have partially what I am looking for: Relative frequencies / proportions with dplyr and Calculate relative frequency for a certain group.

Using the mtcars dataset, this is what the output would look like if I just wanted to look at the proportion of gear by am category:

    mtcars %>%
    group_by(am, gear) %>%
    summarise (n = n()) %>%
    mutate(freq = n / sum(n))

    #   am gear  n      freq
    # 1  0    3 15 0.7894737
    # 2  0    4  4 0.2105263
    # 3  1    4  8 0.6153846
    # 4  1    5  5 0.3846154

However, I actually want to look at not only the gears by am, but also carb by am and cyl by am, separately, in the same table. If I amend the code to:

    mtcars %>%
    group_by (am, gear, carb, cyl) %>%
    summarise (n = n()) %>%
    mutate(freq = n / sum(n))

I get the frequencies for each combination of am, gear, carb, and cyl. Which is not what I want. Is there any way to do this with dplyr?

EDIT

Also, it would be an added bonus if anyone knew of a way to produce the table I want, but with the categories of am as the columns (as in a classic 2x2 table format). Here is an example of what i'm referring to. It is from one of my previous publications. I want to produce this table in R, so that I can output it directly to a word document using RMarkdown:

enter image description here

Community
  • 1
  • 1
RNB
  • 457
  • 1
  • 5
  • 14
  • Is there a reason it has to be done in dplyr? And is one of the 'groups' always the same? (Here it's `am`) – Heroka Jan 04 '16 at 08:48
  • And can you give an example of the table you want? It's certainly possible with some reshaping, but I'm not sure what you're after. – Heroka Jan 04 '16 at 08:57
  • There is no super important reason it has to be done in dplyr, except that i'm trying to learn the package very well, so that I have a consistent method available for producing my tables. Another driving reason for dplyr is that it produces a data frame as output, which allows me to use the stargazer package for producing publication worthy tables that I can then output to a word document using RMarkdown. I am, of course, open to alternative methods that others think are better for doing this. – RNB Jan 05 '16 at 06:51
  • I am having a similar problem. How did you construct your table in the end @RNB? – Frederick Mar 16 '18 at 12:08

3 Answers3

12

One way to solve this, is to turn your data to a long(er) format. You can then use the same code to calculate the outcomes you want, with one extra group_by:

library(reshape2)
library(dplyr)

m_mtcars <- melt(mtcars,measure.vars=c("gear","carb","cyl"))

res <- m_mtcars %>%
  group_by(am, variable, value) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

Building on this, the desired output can be obtained using more reshaping and some string formatting

#make an 'export' variable
res$export <- with(res, sprintf("%i (%.1f%%)", n, freq*100))

#reshape again
output <- dcast(variable+value~am, value.var="export", data=res, fill="missing") #use drop=F to prevent silent missings 
#'silent missings'
output$variable <- as.character(output$variable)
#make 'empty lines' 
empties <- data.frame(variable=unique(output$variable), stringsAsFactors=F)
empties[,colnames(output)[-1]] <- ""

#bind them together
output2 <- rbind(empties,output)
output2 <- output2[order(output2$variable,output2$value),]

#optional: 'remove' variable if value present

output2$variable[output2$value!=""] <- ""

This results in:

   variable value          0         1
2      carb                           
7               1  3 (15.8%) 4 (30.8%)
8               2  6 (31.6%) 4 (30.8%)
9               3  3 (15.8%)   missing
10              4  7 (36.8%) 3 (23.1%)
11              6    missing  1 (7.7%)
12              8    missing  1 (7.7%)
3       cyl                           
13              4  3 (15.8%) 8 (61.5%)
14              6  4 (21.1%) 3 (23.1%)
15              8 12 (63.2%) 2 (15.4%)
1      gear                           
4               3 15 (78.9%)   missing
5               4  4 (21.1%) 8 (61.5%)
6               5    missing 5 (38.5%)
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • 1
    If you are using `dplyr` one another option is `gather` from `tidyr` and connect with `%>%` – akrun Jan 04 '16 at 09:39
7

With tidyr/dplyr combination, here is how you would do it:

library(tidyr)
library(dplyr)

mtcars %>%
  gather(variable, value, gear, carb, cyl) %>%
  group_by(am, variable, value) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))
Gopala
  • 10,363
  • 7
  • 45
  • 77
  • 1
    From the `gather()` help: "Development on `gather()` is complete, and for new code we recommend switching to `pivot_longer()`, which is easier to use, more featureful, and still under active development. `df %>% gather("key", "value", x, y, z)` is equivalent to `df %>% pivot_longer(c(x, y, z), names_to = "key", values_to = "value")`. " – Brian D Jan 17 '20 at 17:18
1

An alternative to group by and then summarize is to use a count().

This just makes the code 1 line more concise

library(reshape2)
library(dplyr)

m_mtcars <- melt(mtcars,measure.vars=c("gear","carb","cyl"))

res <- m_mtcars %>%
  count(am, variable, value) %>%
  mutate(freq = n / sum(n))

The other benefit is that this will save the other values that are lost in a group_by summarize. The resulting table looks like this

enter image description here

chrisleboa
  • 21
  • 1