1

What would be the best tool/package to use to calculate proportions by subgroups? I thought I could try something like this:

data(mtcars)
library(plyr)
ddply(mtcars, .(cyl), transform, Pct = gear/length(gear))

But the output is not what I want, as I would want something with a number of rows equal to cyl. Even if change it to summarise i still get the same problem.

I am open to other packages, but I thought plyr would be best as I would eventually like to build a function around this. Any ideas?

I'd appreciate any help just solving a basic problem like this.

vashts85
  • 1,069
  • 3
  • 14
  • 28
  • 1
    `prop.table(table(mtcars$cyl, mtcars$gear))`? – alistaire May 05 '16 at 18:33
  • This is certainly helpful, but I am hoping to get something in a dataframe format that I could eventually plug into ggplot. Can i do that with this? – vashts85 May 05 '16 at 18:34
  • 1
    Yep. If you wrap it in `data.frame`, it shifts it to long format, which is probably what you'll need for ggplot anyway. You do lose variable names, which is unfortunate, but that's fixable. – alistaire May 05 '16 at 18:37
  • Is there a way to do this with `plyr` though? I am really trying to learn it, and coming up short on resources. – vashts85 May 05 '16 at 18:39
  • 1
    If you're learning, you should learn `dplyr`, which is the successor to `plyr`. You could write equivalent code with `library(dplyr) ; mtcars %>% group_by(cyl, gear) %>% summarise(Freq = n()/nrow(mtcars))` – alistaire May 05 '16 at 18:43
  • @alistaire that is really close to what I want, but it gives me the % for the total dataframe, and not the percentage within each level of `cyl`. How would I go about getting that? – vashts85 May 05 '16 at 18:51

2 Answers2

4
library(dplyr)

mtcars %>%
  count(cyl, gear) %>%
  mutate(prop = prop.table(n))

See ?count, basically, count is a wrapper for summarise with n() but it does the group by for you. Look at the output of just mtcars %>% count(cyl, gear). Then, we add an additional variable with mutate named prop which is the result of calling prop.table() on the n variable we created after as a result of count(cyl, gear).

You could create this as a function using the SE versions of count(), that is count_(). Look at the vignette for Non-Standard Evaluation in the dplyr package.

Here's a nice github gist addressing lots of cross-tabulation variants with dplyr and other packages.

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • This is great, but can you explain what exactly it's doing? And, would it be possible to embed this into a function so that `cyl` and `gear` can be switched out as values? – vashts85 May 05 '16 at 18:54
  • Can you help me figure out why this won't work then: my_func = function() { mtcars %>% count_(cyl, gear) %>% mutate_(prop = prop.table(n)) } my_func() – vashts85 May 05 '16 at 19:03
1

To get frequency within a group:

library(dplyr)
mtcars %>% count(cyl, gear) %>% mutate(Freq = n/sum(n))
# Source: local data frame [8 x 4]
# Groups: cyl [3]
# 
#     cyl  gear     n       Freq
#   (dbl) (dbl) (int)      (dbl)
# 1     4     3     1 0.09090909
# 2     4     4     8 0.72727273
# 3     4     5     2 0.18181818
# 4     6     3     2 0.28571429
# 5     6     4     4 0.57142857
# 6     6     5     1 0.14285714
# 7     8     3    12 0.85714286
# 8     8     5     2 0.14285714

or equivalently,

mtcars %>% group_by(cyl, gear) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))

Careful of what the grouping is at each stage, or your numbers will be off.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Thank you @alistaire. But now, how would I place this within a function? It seems that `dplyr` does not play well with functions and I cannot seem to get around this. I've tried the following but it fails by saying that `Error: unknown column 'x'` my_func= function(x, y) { mtcars %>% group_by_(quote(x), quote(y)) %>% summarise_(Freq = n()) %>% mutate_(Freq = Freq/sum(Freq)) } my_func(gear, wt) – vashts85 May 05 '16 at 19:07
  • 1
    What do you want the function to do? If you chuck either of the above chains into a function, it'll be fine. When you want to generalize the function so you can pass it variable names, you'll need to mix in standard-eval forms; see [Hadley's NSE vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html). Really, start with the normal versions, though; they can take you a long way before you have to worry about the SE forms. – alistaire May 05 '16 at 19:11
  • I want to be able to give a function two parameters: a grouping variable and a proportion variable, and have it produce proportions by that grouping variable such that the data is easily plottable. I am not sure what you mean by starting with normal versions, but am happy to try them. I couldn't get a function to work using any of the command posted here. – vashts85 May 05 '16 at 19:14
  • 1
    The normal versions are the non-standard eval ones that don't end in a `_`; the SE ones end in `_`. The easiest way to get the function you describe to work is `my_func <- function(x, y){mtcars %>% group_by_(x, y) %>% summarise(Freq = n()) %>% mutate(Freq = Freq/sum(Freq))} ; my_func('gear', 'wt')`. Note you only use SE forms when you are passing them strings. (And pass them strings, not unquoted variable names, which takes more work yet.) – alistaire May 05 '16 at 19:17
  • this is supremely helpful. I unfortunately don't understand the differences between the `SE` and `NSE` concepts you are describing, but I think this works well for me! – vashts85 May 05 '16 at 19:23