2

Frequently I use the functions group_by() and summarize() (note: this is the same as count() function if the summary statistic is sum()) functions in the dplyr package in R.

Here's an example of how:

library(dplyr)

data <- data.frame(
  group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
  factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
  var1 = sample(1:16)
)

Here's the output:

out_df <- 
    data %>% 
        group_by(group) %>% 
        summarize(sum_var1 = sum(var1))

print(out_df)

Source: local data frame [7 x 3]
Groups: group [4]

    group   factor sum_var1
   <fctr>   <fctr>    <int>
1 Group A Factor 1       29
2 Group B Factor 1        8
3 Group C Factor 1       33
4 Group D Factor 1       12
5 Group A Factor 2       27
6 Group B Factor 2       10
7 Group C Factor 2       17

Now, I many times want to find what proportion each sum_var1 variable is, not as a proportion of the overall sum, but as a proportion of the sum for a level of a factor, such as the factor variable here.

I usually do this by finding how the sum for each level of the factor, and then manually dividing the observations by it, as the following:

out_df %>% group_by(factor) %>% summarize(factor_sum = sum(sum_var1))
to_divide <- (c(rep(82, 4), rep(54, 4)))
out_df$factor_prop_sum_var1 <- out_df$sum_var1 / to_divide

This leads to the desired output, and I can check that the sum of factor_prop_sum_var1 equals 1:

out_df

Source: local data frame [8 x 4]
Groups: group [4]

    group   factor sum_var1 factor_prop_sum_var1
   <fctr>   <fctr>    <int>                <dbl>
1 Group A Factor 1       26            0.3170732
2 Group B Factor 1       17            0.2073171
3 Group C Factor 1       19            0.2317073
4 Group D Factor 1       18            0.2195122
5 Group A Factor 2        8            0.1481481
6 Group B Factor 2       19            0.3518519
7 Group C Factor 2        7            0.1296296
8 Group D Factor 2       22            0.4074074

out_df %>% group_by(factor) %>% summarize(checking = sum(factor_prop_sum_var1))

# A tibble: 2 × 2
    factor checking
    <fctr>    <dbl>
1 Factor 1        1
2 Factor 2        1

This works, but it's very clunky at best. Is there a way to do this more, uh, elegantly, (preferably within the dplyr "pipeline")?

Joshua Rosenberg
  • 4,014
  • 9
  • 34
  • 73

1 Answers1

5

To get proportions within groups, just group only by the columns within which you want the proportions to add to 100%. So, in this case, after getting the sum for each combination of group and factor, use group_by again, but this time group only by factor and then calculate percentages.

library(dplyr)

set.seed(100)
data <- data.frame(
  group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
  factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
  var1 = sample(1:16)
)

data %>% 
  group_by(group, factor) %>% 
  summarize(sum_var1 = sum(var1)) %>%
  group_by(factor) %>%
  mutate(percent = sum_var1/sum(sum_var1)) %>%
  arrange(factor)
    group   factor sum_var1    percent
1 Group A Factor 1       13 0.25000000
2 Group B Factor 1        8 0.15384615
3 Group C Factor 1       21 0.40384615
4 Group D Factor 1       10 0.19230769
5 Group A Factor 2       20 0.23809524
6 Group B Factor 2       27 0.32142857
7 Group C Factor 2        2 0.02380952
8 Group D Factor 2       35 0.41666667
eipi10
  • 91,525
  • 24
  • 209
  • 285