I'm having a hard time making rounded percentages that add up to 100% within groups.
Consider the following example:
# Loading main library used
library(dplyr)
# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
values = c(2200,4700,3000,2000,2900,4400,2200,1000))
print(df)
# group categories values
# 1 A Cat1 2200
# 2 A Cat2 4700
# 3 A Cat3 3000
# 4 A Cat4 2000
# 5 B Cat1 2900
# 6 B Cat2 4400
# 7 B Cat3 2200
# 8 B Cat4 1000
df_with_shares = df %>%
# Calculating group totals and adding them back to the main df
left_join(df %>%
group_by(group) %>%
summarize(group_total = sum(values)),
by='group') %>%
# Calculating each category's share within the groups
mutate(group_share = values / group_total,
group_share_rounded = round(group_share,2))
# Summing the rounded shares within groups
rounded_totals = df_with_shares %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <int>
# 1 A 0.99
# 2 B 1.01
# Note how the totals do not add up to 100% as expected
I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:
df_with_rounded_shares = df %>%
mutate(
percs = values / sum(values),
percs_cumsum = cumsum(percs),
percs_cumsum_round = round(percs_cumsum, 2),
percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)
However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values
column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.
What can I do to generate a column of rounded percentages that add up to 100% by group?
PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.