4

Using dplyr I'm generating a simple summary table for two categories:

# Data
data("mtcars")
# Lib
require(dplyr)
# Summary
mt_sum <- mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n()) %>%
  spread(key = am, value = n)

Which produces the desired results:

Source: local data frame [3 x 3]

   gear     0     1
  (dbl) (int) (int)
1     3    15    NA
2     4     4     8
3     5    NA     5

To the generated table I would like to add a set of columns that would have row percentages instead of the presently available totals.

Desired results

I would like for my table to look like that:

   gear     0     1   0per   1per
1     3    15    NA   100%   
2     4     4     8   33%    67%    
3     5    NA     5          100%

Attempts

I tried to achieve the following by adding the code:

mt_sum <- mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n()) %>%
  spread(key = am, value = n) %>%
  mutate_each(funs(./rowSums(.)))

but it returns the following error:

Error: 'x' must be an array of at least two dimensions

Hence my question: how can I add extra columns with row percentage values in dplyr?

Side points

  • I would prefer blank values instead of NAs
  • The table could be easily build with use of CrossTable in gmodels but I would like to stay in dplyr as I want to keep as many transformations as possible in one place
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • When you say "blank values" do you mean zero? because then adding fill=0 to spread() fixes that. fill=" " produces blanks but then the columns are characters. – atiretoo Dec 03 '15 at 15:33
  • @atiretoo Thank you for showing the interest. I should be more precise, `0` will do. Also for the export purposes an empty string would be OK. But I'm not too fussy about this, it's only a side point. – Konrad Dec 03 '15 at 15:39

3 Answers3

4

I think this is what you need:

# Data
data("mtcars")
# Lib
require(dplyr)
require(tidyr)
require(scales) #for percent
# Summary
mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n()) %>%
  spread(key = am, value = n) %>%
  #you need rowwise because this is a rowwise operation
  rowwise %>%
  #I find do to be the best function for ad-hoc things that 
  #have no specific dplyr function
  #I use do below to calculate the numeric percentages
  do(data.frame(.,
                per0 = .$`0` / sum(.$`0`, .$`1`, na.rm=TRUE),
                per1 = .$`1` / sum(.$`0`, .$`1`, na.rm=TRUE))) %>%
  #mutate here is used to convert NAs to blank and numbers to percentages
  mutate(per0 = ifelse(is.na(per0), '', percent(per0)),
         per1 = ifelse(is.na(per1), '', percent(per1)))

Output:

Source: local data frame [3 x 5]
Groups: <by row>

   gear    X0    X1  per0  per1
  (dbl) (int) (int) (chr) (chr)
1     3    15    NA  100%      
2     4     4     8 33.3% 66.7%
3     5    NA     5        100%
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Thanks for the brilliant solution, would there be a way to create `per0` and `per1` dynamically? I would like to be able to apply this solution to a `data.frame` where I may have a number of groups? – Konrad Dec 03 '15 at 15:58
  • Thanks @Konrad, I m happy I could be of help. I ll redirect you to [this](http://stackoverflow.com/questions/26665319/removing-na-in-dplyr-pipe). You should probably use `na.omit` so that the number of rows stays the same. – LyzandeR Dec 03 '15 at 15:59
  • Thanks very much, I cam across that post when looking for the `NA` solution. – Konrad Dec 03 '15 at 16:00
  • @Konrad I suspect you could with an `lapply` within `do` and `mutate_each` on the next step. Lots of things change in dplyr if you want to do a multiple column operation without specifying the names. Most likely you would have to write some of the column names down anyway. – LyzandeR Dec 03 '15 at 16:00
  • 1
    Thanks very much, I will have a look. – Konrad Dec 03 '15 at 16:02
4

Here is a way to do it with reshaping:

library(dplyr) library(tidyr)

mtcars %>%
  count(gear, am) %>%
  mutate(percent = n / sum(n)) %>%
  gather(variable, value, 
         n, percent) %>%
  unite("new_variable", am, variable) %>%
  spread(new_variable, value)
bramtayl
  • 4,004
  • 2
  • 11
  • 18
3

So this gets part way there, but doesn't do it all in a single expression and doesn't rename the variables. @LyzandeR 's solution is better.

library(tidyr)
library(dplyr)
mt_sum <- mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n()) %>%
  spread(key = am, value = n, fill=0) 
row_sum <- rowSums(mt_sum[,2:3])
mt_sum <- mutate_each(mt_sum[,2:3],funs(./row_sum)) %>% bind_cols(mt_sum)
atiretoo
  • 1,812
  • 19
  • 33