2

I have a data.frame of categorical variables that I have divided into groups and I got the counts for each group.

My original data nyD looks like:

Source: local data frame [7 x 3]
Groups: v1, v2, v3

  v1    v2   v3
1  a  plus  yes
2  a  plus  yes
3  a minus   no
4  b minus  yes
5  b     x  yes
6  c     x notk
7  c     x notk

I performed the following operations using dplyr:

ny1 <- nyD %>% group_by(v1,v2,v3)%>%
           summarise(count=n()) %>%
           mutate(prop = count/sum(count))


My data "ny1" looks like:

Source: local data frame [5 x 5]
Groups: v1, v2

  v1    v2   v3 count prop
1  a minus   no     1    1
2  a  plus  yes     2    1
3  b minus  yes     1    1
4  b     x  yes     1    1
5  c     x notk     2    1

I want to calculate the relative frequency in relation to the V1 Groups in the prop variable. The prop variable should be the corresponding count divided by the "sum of counts for V1 group". V1 group has a total of 3 "a", 2 "b" and 1 "c". That is, ny1$prop[1] <- 1/3, ny1$prop[2] <- 2/3.... The mutate operation where using count/sum(count) is not correct. I need to specify that the sum should be realed only to V1 group. Is there a way to use dplyr to achieve this?

andreSmol
  • 1,028
  • 2
  • 18
  • 30

1 Answers1

6

You can do this whole thing in one step (from your original data nyD and without creating ny1). That is because when you'll run mutate after summarise, dplyr will drop one aggregation level (v2) by default (certainly my favorite feature in dplyr) and will aggregate only by v1

nyD %>% 
   group_by(v1, v2) %>%
   summarise(count = n()) %>%
   mutate(prop = count/sum(count))

# Source: local data frame [5 x 4]
# Groups: v1
# 
#   v1    v2 count      prop
# 1  a minus     1 0.3333333
# 2  a  plus     2 0.6666667
# 3  b minus     1 0.5000000
# 4  b     x     1 0.5000000
# 5  c     x     2 1.0000000

Or a shorter version using count (Thanks to @beginneR)

df %>% 
  count(v1, v2) %>% 
  mutate(prop = n/sum(n))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 3
    Or a bit shorter: `count(df, v1, v2) %>% mutate(prop = n/sum(n))` – talat Jan 01 '15 at 16:22
  • Note that the order of the variables in the `group()` expression is also important and will determine how the relative proportions are computed. – Keith Hughitt Dec 19 '16 at 14:39
  • Note that the original question had 3 groups; with 2 groups the answer fails to give relative frequences. However, the count version works with more than 2 groups. – bshor Oct 30 '20 at 20:05
  • @bshor What do you mean? Please provide an example – David Arenburg Oct 31 '20 at 17:27
  • This is relatively helpful, but I think that it needs a more general solution to compute the proportions within groups e.g. for a questionnaire response you would want the percentage of men who answere a certain questions VS. the percentage of women who answered the same way. – Gmichael Jul 01 '22 at 07:22