Frequently I use the functions group_by()
and summarize()
(note: this is the same as count()
function if the summary statistic is sum()
) functions in the dplyr
package in R
.
Here's an example of how:
library(dplyr)
data <- data.frame(
group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
var1 = sample(1:16)
)
Here's the output:
out_df <-
data %>%
group_by(group) %>%
summarize(sum_var1 = sum(var1))
print(out_df)
Source: local data frame [7 x 3]
Groups: group [4]
group factor sum_var1
<fctr> <fctr> <int>
1 Group A Factor 1 29
2 Group B Factor 1 8
3 Group C Factor 1 33
4 Group D Factor 1 12
5 Group A Factor 2 27
6 Group B Factor 2 10
7 Group C Factor 2 17
Now, I many times want to find what proportion each sum_var1
variable is, not as a proportion of the overall sum, but as a proportion of the sum for a level of a factor, such as the factor
variable here.
I usually do this by finding how the sum for each level of the factor, and then manually dividing the observations by it, as the following:
out_df %>% group_by(factor) %>% summarize(factor_sum = sum(sum_var1))
to_divide <- (c(rep(82, 4), rep(54, 4)))
out_df$factor_prop_sum_var1 <- out_df$sum_var1 / to_divide
This leads to the desired output, and I can check that the sum
of factor_prop_sum_var1
equals 1
:
out_df
Source: local data frame [8 x 4]
Groups: group [4]
group factor sum_var1 factor_prop_sum_var1
<fctr> <fctr> <int> <dbl>
1 Group A Factor 1 26 0.3170732
2 Group B Factor 1 17 0.2073171
3 Group C Factor 1 19 0.2317073
4 Group D Factor 1 18 0.2195122
5 Group A Factor 2 8 0.1481481
6 Group B Factor 2 19 0.3518519
7 Group C Factor 2 7 0.1296296
8 Group D Factor 2 22 0.4074074
out_df %>% group_by(factor) %>% summarize(checking = sum(factor_prop_sum_var1))
# A tibble: 2 × 2
factor checking
<fctr> <dbl>
1 Factor 1 1
2 Factor 2 1
This works, but it's very clunky at best. Is there a way to do this more, uh, elegantly, (preferably within the dplyr
"pipeline")?