2

I'm looking for a solution for the following problem: I have data that contains two factor variables EDU and LEVEL. The reproducible data sample is here:

structure(list(EDU = structure(c(3L, 1L, 2L, 2L, 3L, 2L, 3L, 
2L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 3L, 2L, 
2L, 2L, 1L, 1L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 
3L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 1L, 3L, 1L), .Label = c("A", 
"B", "C"), class = "factor"), LEVEL = structure(c(3L, 3L, 4L, 
2L, 4L, 3L, 1L, 2L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 4L, 2L, 2L, 
4L, 1L, 2L, 3L, 3L, 1L, 4L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 
4L, 3L, 1L, 4L, 3L, 4L, 1L, 4L, 2L, 4L, 1L, 1L, 4L, 3L, 1L), .Label = c("1", 
"2", "3", "4"), class = "factor")), class = "data.frame", row.names = c(NA, 
-50L))

Using this data I want to plot a barplot with ggplot2 showing the grouping variable EDU on the x-axis and the cumulative percentages of LEVEL on the y-axis. Additionally I want to add a fourth bar that contains the percentages of LEVEL but not grouped by EDU -- somewhat like an "overall bar". Furthermore I want to add percentage labels within the plot, so that every LEVEL is labelled with the corresponding relative frequencies like in this thread or this thread. To be honest, I tried to adapt my code with different solutions from stackoverflow to get the percentage labels into the plot as there are a lot of threads on this topic (especially when it comes to percentage labels) but stucked. So far, my ggplot2 code looks like this:

library(tidyverse)

ggplot(df, aes(x=EDU, fill=LEVEL)) +
  geom_bar(position="fill") +
  scale_y_continuous(labels = scales::percent)

And results in the following plot:

Example plot

That plot looks good so far. But as above-mentioned my aim is to add percentage labels, probably with geom_text AND a fourth "overall bar" besides the three existing ones. For the percentage labels I also tried to make a prop.table and added the percentage labels with the corresponding props and annotate:

props <- prop.table(table(df$EDU, df$LEVEL), margin=1)

ggplot(df, aes(x=EDU, fill=LEVEL)) +
  geom_bar(position="fill") +
  scale_y_continuous(labels = scales::percent) +
  annotate("text", x="A", y=.15, label=scales::percent(props[1,4])) +
  annotate("text", x="B", y=.10, label=scales::percent(props[2,4])) +
  annotate("text", x="C", y=.275, label=scales::percent(props[3,4])) +
  
  annotate("text", x="A", y=.375, label=scales::percent(props[1,3])) +
  annotate("text", x="B", y=.275, label=scales::percent(props[2,3])) +
  annotate("text", x="C", y=.625, label=scales::percent(props[3,3])) +
  
  annotate("text", x="A", y=.66, label=scales::percent(props[1,2])) +
  annotate("text", x="B", y=.5, label=scales::percent(props[2,2])) +
  annotate("text", x="C", y=.78, label=scales::percent(props[3,2])) +
  
  annotate("text", x="A", y=.9, label=scales::percent(props[1,1])) +
  annotate("text", x="B", y=.9, label=scales::percent(props[2,1])) +
  annotate("text", x="C", y=.9, label=scales::percent(props[3,1])) 

That results in the following plot: Example 2

This seems cumbersome to me, especially when I want to create more than one plot and have to annotate each percentage separately. Here, the question might be how I can set the y-arguments in annotate in an "automised" way to let R position the labels for me.

Regarding the "overall bar" problem I have no idea how to solve this, unfortunately.

I'm grateful for any help!

fbeese
  • 118
  • 8
  • hi - why don't you add the few lines that you used to add the percent. It's probably the way to go. It's also helping us not showing exactly the same thing that you know already. Please clarify what you mean with "overall column". – tjebo Dec 27 '21 at 14:50
  • Hi, the solution with the `prop.table` was to use the following code: `props <- prop.table(table(df$EDU, df$LEVEL), margin=1)` and then use `annotate` within `ggplot` function to include percentage labels in the bars. But as seen in other threads there are other ways without labelling each bar seperately with `annotate`. Unfortunately, I failed to adapt my code like in the other threads. – fbeese Dec 27 '21 at 15:08
  • 1
    The "overall bar" problem is the following: Now, there are three bars in the plot which are the relative frequencies of `LEVEL` stratified by `EDU`. I want to include a fourth bar which shows the relative frequencies of `LEVEL` without the `EDU` stratification to show the distribution of `LEVEL` across all observations. Hope that make it more understandable. – fbeese Dec 27 '21 at 15:12

1 Answers1

3

Rest assured: The more experienced you get, the less you will be afraid of preparing the data beforehand. You will see that it is often way easier and cleaner to prepare the data first to what you want to plot, and then to plot. Don't try to do everything within ggplot2, that can get quite painful.

Comments in the code

library(tidyverse)

##  create a percentage column manually
df_perc <- 
  df %>% 
  count(EDU, LEVEL) %>%
  group_by(EDU) %>%
  mutate(perc = n*100/sum(n)) 

## for the total, create a new data frame and bind to the old one
total <- 
  df_perc %>%
  group_by(LEVEL) %>%
  summarise(n = sum(n)) %>%
  ## ungroup for the total
  ungroup() %>%
  ## add EDU column called total, so you can bind it and plot it easily 
  mutate(perc= n*100/sum(n), EDU = "Total")

## now bind them and plot them
bind_rows(df_perc, total) %>%
ggplot(aes(x=EDU, y = perc, fill=LEVEL)) +
  ## use geom_col, and remove position = fill
  geom_col() +
  # now you can add the labels easily as per all those threads
  geom_text(aes(label = paste(round(perc, 2), "%")), position = position_stack(vjust = .5)) +
  ## you can either change the y values, or use a different scale factor
  scale_y_continuous("Percent", labels = function(x) scales::percent(x, scale = 1))

tjebo
  • 21,977
  • 7
  • 58
  • 94