7

I'm looking for a way to label a stacked bar chart with percentages while the y-axis shows the original count (using ggplot). Here is a MWE for the plot without labels:

library(ggplot2)
df <- as.data.frame(matrix(nrow = 7, ncol= 3,
                       data = c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7",
                                "north", "north", "north", "north", "south", "south", "south",
                                "A", "B", "B", "C", "A", "A", "C"),
                      byrow = FALSE))

colnames(df) <- c("ID", "region", "species")

p <- ggplot(df, aes(x = region, fill = species))
p  + geom_bar()

I have a much larger table and R counts quite nicely the different species for every region. Now, I would like to show both, the original count value (preferably on the y-axis) and the percentage (as label) to compare proportions of species between regions.

I tried out many things using geom_text() but I think the main difference to other questions (e.g. this one) is that

  • I do not have a separate column for y values (they are just the counts of different species per region) and
  • I need the labels per region to sum up to 100% (since they are considered to represent seperate populations), not all labels of the entire plot.

Any help is much appreciated!!

eipi10
  • 91,525
  • 24
  • 209
  • 285
Johanna
  • 323
  • 1
  • 3
  • 10
  • 5
    When you're doing something non-standard you usually need to compute the numbers yourself. It *might* be possible to do this inside ggplot, but it won't be straightforward. Better to use functions built for data manipulation then trying to do data manipulation within ggplot. – Gregor Thomas Jun 14 '16 at 16:54

2 Answers2

14

As @Gregor mentioned, summarize the data separately and then feed the data summary to ggplot. In the code below, we use dplyr to create the summary on the fly:

library(dplyr)

ggplot(df %>% count(region, species) %>%    # Group by region and species, then count number in each group
         mutate(pct=n/sum(n),               # Calculate percent within each region
                ypos = cumsum(n) - 0.5*n),  # Calculate label positions
       aes(region, n, fill=species)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=paste0(sprintf("%1.1f", pct*100),"%"), y=ypos))

enter image description here

Update: With dplyr 0.5 and later, you no longer need to provide a y-value to center the text within each bar. Instead you can use position_stack(vjust=0.5):

ggplot(df %>% count(region, species) %>%    # Group by region and species, then count number in each group
         mutate(pct=n/sum(n)),              # Calculate percent within each region
       aes(region, n, fill=species)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=paste0(sprintf("%1.1f", pct*100),"%")), 
            position=position_stack(vjust=0.5))
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • 1
    Thanks a lot, this is exactly what I was looking for! – Johanna Jun 15 '16 at 08:22
  • 1
    Note that the code presented above will NOT produce the barplot shown! You have to use a `group_by` command in addition to that: `df %>% group_by(region) %>% count(region, species) %>% mutate(pct=n/sum(n)` – J_F Dec 07 '17 at 09:48
  • 3
    `group_by` is unnecessary. `count(x,y)` is the equivalent of `group_by(x,y) %>% tally`. – eipi10 Dec 07 '17 at 16:36
1

I agree with Johanna. You could try:

d <- aggregate(.~region+species, df, length)
d$percent <- paste(round(ID/sum(ID)*100),'%',sep='')
ggplot(d, aes(region, ID, fill=species)) + geom_bar(stat='identity') + 
  geom_text(position='stack', aes(label=paste(round(ID/sum(ID)*100),'%',sep='')), vjust=5)
teadotjay
  • 1,395
  • 12
  • 15
  • Thanks for you input, but in your solution the percentages per stack do not sum up to 100%. BTW: I guess it should be `d$percent <- paste(round(d$ID/sum(d$ID)*100),'%',sep='')`. – Johanna Jun 15 '16 at 08:24