1

I'm having a hard time making rounded percentages that add up to 100% within groups.

Consider the following example:

# Loading main library used
library(dplyr)

# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
                categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
                values = c(2200,4700,3000,2000,2900,4400,2200,1000))

print(df)
#   group categories values
# 1     A       Cat1   2200
# 2     A       Cat2   4700
# 3     A       Cat3   3000
# 4     A       Cat4   2000
# 5     B       Cat1   2900
# 6     B       Cat2   4400
# 7     B       Cat3   2200
# 8     B       Cat4   1000

df_with_shares = df %>%
  # Calculating group totals and adding them back to the main df
  left_join(df %>% 
              group_by(group) %>% 
              summarize(group_total = sum(values)),
            by='group') %>%
  # Calculating each category's share within the groups
  mutate(group_share = values / group_total,
         group_share_rounded = round(group_share,2))


# Summing the rounded shares within groups
rounded_totals = df_with_shares %>% 
  group_by(group) %>% 
  summarize(total_share = sum(group_share_rounded))


print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr>       <int>
#   1 A        0.99
#   2 B        1.01
# Note how the totals do not add up to 100% as expected

I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:

df_with_rounded_shares = df %>% 
  mutate(
    percs = values / sum(values),
    percs_cumsum = cumsum(percs),
    percs_cumsum_round = round(percs_cumsum, 2),
    percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
    percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)

However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.

What can I do to generate a column of rounded percentages that add up to 100% by group?

PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.

Felipe D.
  • 1,157
  • 9
  • 19

1 Answers1

2

The method devised in your implementation (from here) just needs a few small tweaks to make it work.

First, include a group_by statement before calculating the new columns. Also, you need to use a summarize statement instead of the mutate statement you have now.

In essence, this is what it'll look like:

# Modified version of your implementation of the rounding procedure. 
# The new procedure below accommodates for grouping variables.
df_with_rounded_shares_by_group = df %>% 
  group_by(group) %>% 
  summarize(
    group_share = values / sum(values),
    group_share_cumsum = cumsum(group_share),
    group_share_cumsum_round = round(group_share_cumsum, 2),
    group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
    group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
  # Removing unnecessary temporary columns
  select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)

# Verifying if the results add up to 100% within each group
rounded_totals = df_with_rounded_shares_by_group %>% 
  group_by(group) %>% 
  summarize(total_share = sum(group_share_rounded_final))

print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr>       <dbl>
#   1 A           1
#   2 B           1

# Yep, they all add up to 100% as expected!

Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.

Felipe D.
  • 1,157
  • 9
  • 19