3

Consider this

var1<- c('A','B','C','C','G')
var2<- c('G','B','B','F','A')
df = df=data.frame(var1,var2)

> df
  var1 var2
1    A    G
2    B    B
3    C    B
4    C    F
5    G    A

I want to get a two-way table between var1 and var2.

I can use

> df %>% count(var1,var2) %>% spread(var2, n, fill = 0)
Source: local data frame [4 x 5]
Groups: var1 [4]

    var1     A     B     F     G
* <fctr> <dbl> <dbl> <dbl> <dbl>
1      A     0     0     0     1
2      B     0     1     0     0
3      C     0     1     1     0
4      G     1     0     0     0

or using table

> table( df$var1, df$var2)

    A B F G
  A 0 0 0 1
  B 0 1 0 0
  C 0 1 1 0
  G 1 0 0 0

Both give the same results. But now suppose I am interested in the row or columns proportions. I can use table again here

> prop.table(table( df$var1, df$var2), margin =2) # for column

      A   B   F   G
  A 0.0 0.0 0.0 1.0
  B 0.0 0.5 0.0 0.0
  C 0.0 0.5 1.0 0.0
  G 1.0 0.0 0.0 0.0

> prop.table(table( df$var1, df$var2), margin =1) # for rows

      A   B   F   G
  A 0.0 0.0 0.0 1.0
  B 0.0 1.0 0.0 0.0
  C 0.0 0.5 0.5 0.0
  G 1.0 0.0 0.0 0.0

How can I reproduce these with dplyr (the shortest syntax)?

The solution below seems to work (only in one case), although I don't understand why: what does the prop.table(n) function does here?

> df %>%  count(var1,var2) %>% mutate(prop = prop.table(n)) %>% spread(var2, prop, fill = 0)
Source: local data frame [4 x 6]
Groups: var1 [4]

    var1     n     A     B     F     G
* <fctr> <int> <dbl> <dbl> <dbl> <dbl>
1      A     1     0   0.0   0.0     1
2      B     1     0   1.0   0.0     0
3      C     1     0   0.5   0.5     0
4      G     1     1   0.0   0.0     0

Also, using margin in prop.table fails here.

> df %>%  count(var1,var2) %>% mutate(prop = prop.table(n ,margin =1)) %>% spread(var2, prop, fill = 0)
Error in eval(substitute(expr), envir, enclos) : 'x' is not an array

Thanks!

zx8754
  • 52,746
  • 12
  • 114
  • 209
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 3
    Perhaps see [here](http://stackoverflow.com/questions/36033449/dplyr-pipe-how-to-add-a-margin-row-calculating-a-total-like-addmargins-functio), [here](http://stackoverflow.com/questions/31164350/dplyr-summarize-with-subtotals) and [here](http://stackoverflow.com/questions/26328336/using-dplyr-to-add-summary-rows). – Axeman Oct 26 '16 at 15:22
  • 2
    `df %>% janitor::crosstab(var1, var2, percent = 'row')` – alistaire Oct 26 '16 at 15:28
  • @alistaire crazy package! is it robust and trusted? – ℕʘʘḆḽḘ Oct 26 '16 at 15:32
  • 2
    It's pretty new, but pretty solidly built because it's focused on narrow tasks. It's only made me happy so far. – alistaire Oct 26 '16 at 15:35
  • good catch buddy, very nice – ℕʘʘḆḽḘ Oct 26 '16 at 15:38

3 Answers3

4

Pipelines are really a magrittr rather than dplyr thing:

df %>% with(table(var1, var2)) %>% prop.table(margin=2)
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 2
    If you're going full magrittr, you could do `df %$% table(var1, var2) %>% prop.table(margin=2)` (or just `df %>% table() %>% prop.table(margin=2)` for this particular case). I think the issue is that you end up with a table instead of a data.frame, though. – alistaire Oct 26 '16 at 15:37
3

You can use group_by to mimic the desired margin, then divide each count by the sum of counts in the group (note, remove column n to avoid problems when spreading):

df %>%
  count(var1,var2) %>%
  group_by(var2) %>%
  mutate(prop = n / sum(n)) %>%
  select(-n) %>%
  spread(var2, prop, fill = 0)

gives

    var1     A     B     F     G
1      A     0   0.0     0     1
2      B     0   0.5     0     0
3      C     0   0.5     1     0
4      G     1   0.0     0     0
Mark Peterson
  • 9,370
  • 2
  • 25
  • 48
2

You can also just reverse the variables in count:

res1 <- df %>% count(var1,var2) %>% 
               mutate(prop = prop.table(n)) %>% spread(var2, prop, fill = 0)
##Source: local data frame [4 x 6]
##Groups: var1 [4]
##
##    var1     n     A     B     F     G
##* <fctr> <int> <dbl> <dbl> <dbl> <dbl>
##1      A     1     0   0.0   0.0     1
##2      B     1     0   1.0   0.0     0
##3      C     1     0   0.5   0.5     0
##4      G     1     1   0.0   0.0     0
res2 <- df %>% count(var2,var1) %>% 
               mutate(prop = prop.table(n)) %>% spread(var2, prop, fill = 0)
### A tibble: 4 x 6
##    var1     n     A     B     F     G
##* <fctr> <int> <dbl> <dbl> <dbl> <dbl>
##1      A     1     0   0.0     0     1
##2      B     1     0   0.5     0     0
##3      C     1     0   0.5     1     0
##4      G     1     1   0.0     0     0

Explanation: as explained by @alistaire in his comments below:

When passed a vector, prop.table just divides each element by the sum, and is thus equivalent to n / sum(n)

Here, count produces the count of the pairs and the result is a group_df grouped by the first variable input to count. Therefore, the subsequent call to prop.table will compute the proportions by each value of that variable. To see this:

df %>% count(var1,var2) %>% mutate(prop = prop.table(n))
Source: local data frame [5 x 4]
Groups: var1 [4]  ## Note: grouped by var1

    var1   var2     n  prop
  <fctr> <fctr> <int> <dbl>
1      A      G     1   1.0
2      B      B     1   1.0
3      C      B     1   0.5
4      C      F     1   0.5
5      G      A     1   1.0

df %>% count(var2,var1) %>% mutate(prop = prop.table(n))
Source: local data frame [5 x 4]
Groups: var2 [4]  ## Note: grouped by var2

    var2   var1     n  prop
  <fctr> <fctr> <int> <dbl>
1      A      G     1   1.0
2      B      B     1   0.5
3      B      C     1   0.5
4      F      C     1   1.0
5      G      A     1   1.0
aichao
  • 7,375
  • 3
  • 16
  • 18
  • 1
    Thanks, I wish I can tell you exactly what `prop.table` does here, but if you look at `str(df %>% count(var1,var2))` vs `str(df %>% count(var2,var1))` you will note that it is a `grouped_df` and that the grouping is by the attribute `vars`, which is `var1` in the first case and `var2` in the second. The subsequent call to `prop.table` must use this to determine the margin table, which is similar to Mark Peterson's solution. – aichao Oct 26 '16 at 15:49
  • 2
    When passed a vector, `prop.table` just divides each element by the sum, and is thus equivalent to `n / sum(n)`. – alistaire Oct 26 '16 at 16:02
  • 1
    @alistaire: Thanks, good to know. So, because the output of `count(var2,var1)` is actually grouped by the first input to `count`, we get the margin by that variable? – aichao Oct 26 '16 at 16:06