2

I'm following very useful solution on creating a summary column for multiple categories. As discussed in the linked solution, I am working with a code that generates the percentage column for each subgroup.

Relevant sample code from the linked solution:

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))

The code generates the desired values:

## Source: local data frame [4 x 4]
## Groups: am
## 
##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

Problem

I would like modify this code to dynamically created columns pertaining to the unique categories available in the second category passed in the dplyr call. This would be gear in case of the attached example. So in case of the attached example, the resulting data frame would look like that:

   am gear  n rel.freq_gear3 rel.freq_gear4  rel.freq_gear5
 1  0    3 15      79%            21%
 2  1    4  8      0              62%            38%

Attempts

For a small number of categories I presume that I would be able to make use of the summarising the values in conditionally, as discussed here, where I would try to execute dplyr statements only for specified conditions sumBfoo = sum(B[A=="foo"])). However, this approach would be inefficient when dealing with multiple categories. Outside dplyr solution could be developed with use of a loop and jumping through unique values of the desired category but my desire would be to do this in dplyr.

Sample table

Broadly speaking, I would like to create a table similar to the one below:

 library(gmodels)
 CrossTable(mtcars$am, mtcars$gear)


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  32 


             | mtcars$gear 
   mtcars$am |         3 |         4 |         5 | Row Total | 
-------------|-----------|-----------|-----------|-----------|
           0 |        15 |         4 |         0 |        19 | 
             |     4.169 |     1.371 |     2.969 |           | 
             |     0.789 |     0.211 |     0.000 |     0.594 | 
             |     1.000 |     0.333 |     0.000 |           | 
             |     0.469 |     0.125 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|
           1 |         0 |         8 |         5 |        13 | 
             |     6.094 |     2.003 |     4.339 |           | 
             |     0.000 |     0.615 |     0.385 |     0.406 | 
             |     0.000 |     0.667 |     1.000 |           | 
             |     0.000 |     0.250 |     0.156 |           | 
-------------|-----------|-----------|-----------|-----------|
Column Total |        15 |        12 |         5 |        32 | 
             |     0.469 |     0.375 |     0.156 |           | 
-------------|-----------|-----------|-----------|-----------|

But I'm interested only in row proportions without counts and totals and other gadgets.

Community
  • 1
  • 1
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • 1
    This gets you closer but it's not clear to me how you decide to remove some of the rows from the `n`-column. `library(tidyr); count(mtcars, am, gear) %>% mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%")) %>% spread(gear, rel.freq)` – talat Nov 30 '15 at 13:38
  • @docendodiscimus Thanks very much for showing the interest in my modest problem. I was also thinking of implementing a way of creating a matrix of sizes **category 1 x category 2** and then inserting values in each cell for a specific combination of groups. Rather onerous solution, I would hope that maybe there is way of doing that in a more efficient manner via `dplyr`. – Konrad Nov 30 '15 at 13:42
  • There is a similar solution [here](http://stackoverflow.com/questions/19500474/find-proportion-across-categories-grouped-by-a-second-category-using-ddply) using `plyr` that I just found. – Konrad Nov 30 '15 at 13:45
  • 1
    I still don't understand how you removed some of the information (from column "n").. perhaps someone else does. – talat Nov 30 '15 at 13:47
  • I agree, the `n` column in your expected output appears to be wrong. You are also missing a 0 (or NA) in the first row, last column. – Axeman Nov 30 '15 at 13:59

1 Answers1

1

dplyr

Building of the comment by @docendo discimus:

library(tidyr)
count(mtcars, am, gear) %>% 
  mutate(rel.freq = n/sum(n)) %>% 
  spread(gear, rel.freq) %>% 
  group_by(am) %>%
  summarize_each(funs(sum2 = sum(., na.rm = TRUE))) %>%
  mutate_each(funs(perc = paste0(round(100 * ., 0), "%")), -am, -n)

Produces:

Source: local data frame [2 x 5]

     am     n     3     4     5
  (dbl) (int) (chr) (chr) (chr)
1     0    19   79%   21%    0%
2     1    13    0%   62%   38%

base

prop.table(table(mtcars$am, mtcars$gear), 1) %>% 
  round(2) %>% 
  '*'(100)

Produces:

   3  4  5
0 79 21  0
1  0 62 38
Axeman
  • 32,068
  • 8
  • 81
  • 94
  • Thanks very much for your contribution, it's a neat solution. Can I ask, how could I alter the generated columns names. So for example instead of `3` the column name would correspond to `gear_3`? – Konrad Nov 30 '15 at 14:42
  • `names(output)[-(1:2)] <- paste("gear", names(output)[-(1:2)], sep = "_")` – Axeman Nov 30 '15 at 14:50
  • Fair enough, daft question on my side. Don't know why, I was thinking on renaming columns inside `dplyr` instead of making use of the `base`. – Konrad Nov 30 '15 at 14:52
  • There might be a way, but this seems easier. – Axeman Nov 30 '15 at 14:55