How to calculate share per category within a column?

Question

df = data.frame(week = as.factor(rep(c(1, 2), times = 5)),
                name = as.factor(rep(LETTERS[1:5], times = 2)),
                count = rpois(n = 10, lambda = 20))

    > df
     week   name count
1       1      A    16
2       2      B    14
3       1      C    23
4       2      D    15
5       1      E    12
6       2      A    15
7       1      B    23
8       2      C    22
9       1      D    22
10      2      E    26

I'd like to calculate each name's count share per week. At first I was going to use the following method:

transform(df, week1_share = ifelse(week == "1", round((df$count / sum(df$count)  * 100),2), NA))
transform(df, week2_share = ifelse(week == "2", round((df$count / sum(df$count)  * 100),2), NA))

but then making each column to merge, to eventually put it as label on the bar plot, seemed too inefficient. There must be some type of quick solution for this that I dont know of yet.

Basically what I would like to do is as follows but add the share% that may have been calculated as above to match within each box.

ggplot(df, aes(reorder(week, -count),count, color = "white", group = name, fill = name))+
        geom_bar(position = "stack", stat = "identity") +
        scale_y_continuous(labels=comma)+
        ggthemes::scale_color_tableau()

I don't know why the reorder function often fails upon me. If you have any tips to sort the order in desc, please share.

you mean `aggregate(count ~ name, df, function(i)round(i*100/sum(i), 2))`? or `df$new <- with(df, ave(count, name, FUN = function(i)(round(i*100/sum(i), 2))))` — Sotos, Nov 10 '16 at 08:42
For the count share per week you can use dplyr to group by weeks and mutate to add the column. `library(dplyr)` and `df<- mutate(group_by(df,week), round(count/sum(count) * 100, 2))` — Jim Raynor, Nov 10 '16 at 08:53
Hi, good question, can you fix your typo: `data_frame` instead of `data.frame`, for people copy-pasting the data. — snaut, Nov 10 '16 at 09:02

Prradep · Accepted Answer · 2016-11-10T09:26:57.933

3

The data provided by you has been used:

# Loading the required data
df = data.frame(week = as.factor(rep(c(1, 2), times = 5)),
                name = as.factor(rep(LETTERS[1:5], times = 2)),
                count = rpois(n = 10, lambda = 20))

Using plyr package function, percentage and the relative positions for labelling have been calculated.

#Loading the required packages    
library(plyr)
library(ggplot2)

# Calculating the percentages
df = ddply(df, .(week), transform, percent = round(count/sum(count) * 100))

# Calculating the position for plotting
df = ddply(df, .(week), transform, pos = cumsum(percent) - (0.5 * percent))

Using the information calculated above, plotting has been done.

# Basic graph
p10 <- ggplot() + geom_bar(aes(y = percent, x = week, fill = name), 
                       data = df, stat="identity")

# Adding data labels
p10 <- p10 + geom_text(data=df, aes(x = week, y = pos, 
                                label = paste0(percent,"%")), size=4)
p10

Is this what you have been looking for ?

edited Nov 10 '16 at 09:26

answered Nov 10 '16 at 09:14

Prradep

5,506
5
43
84

This was exactly what I was looking for. Thank you so much! I learnt something new! – tmhs Nov 11 '16 at 00:48
I have an extra question though. "# Calculating the position for plotting" is this the method to place the label in the middle of the box? how does this work? Can you give me some reference to read? – tmhs Nov 11 '16 at 01:01
It is used to calculated the cumulative sum within each group, here `week`. You can get the usage of [cumsum](http://stackoverflow.com/a/16850230/4836511), using cumsum in ggplot plotting context [1](http://stackoverflow.com/a/15844938/4836511), [2](http://stackoverflow.com/a/15768612/4836511). – Prradep Nov 11 '16 at 05:23
For some reason in my case, the very same code worked, except for one change - I used: y = 100-pos instead of y = pos. Otherwise my percent labels were placed from down to up, while bars on chart were placed in order from up to down. – Marcin Feb 03 '20 at 11:12

snaut · Answer 2 · 2016-11-10T09:13:42.937

A solution in base R, using split, unsplit and prop.table would be:

df2 <- unsplit(lapply(split(df, df$week), 
                  function(x){
                    x$prop <- prop.table(x$count)
                    x}
                  ), df$week)

In short split returns a list of data.frames split accorting to the second argument, unsplit puts back togeter a list produced by split.

Using the data.table package this is even shorter:

library(data.table)
dt <- data.table(df)
dt[, prop := prop.table(count), by=week]

I'm not really fluent in dplyr, but I'm sure there's also a very short and straight forward solution.

Edit: this is what I came up with in dplyr/magrittr:

library(dplyr)
df3 <- df %>%
   group_by(week) %>%
   mutate(freq = prop.table(count))

I also prefer to use data.table to dplyr. Thank you for sharing your knowledge! — tmhs, Nov 11 '16 at 00:58

How to calculate share per category within a column?

2 Answers2