My goal is to create a barplot that visualises the percentages of three variables; however, my current graph does so in a rather confusing way. A little bit of context: Each of my variables can can have one of two possible values:
- Reference: null or overt
- Variety: SING or GB
- Register: S1A or S1B
Overall, the data frame looks like this (with a few more thousand lines):
Reference Register Variety
1 null S1A SING
2 null S1A SING
3 null S1A SING
4 null S1A SING
5 null S1A SING
6 null S1A SING
I have used the following code to create the barplot below:
data_raw <- read.csv("INPUT.csv", TRUE, ",")
data_2 <- data_raw %>%
count(Reference, Variety, Register) %>%
mutate(pct = n / sum(n),
pct_label = scales::percent(pct))
ggplot(data_2, aes(x= Reference, fill = Variety, y = pct)) +
geom_col() +
geom_text(aes(label = paste(pct_label, n, sep = "\n")),
lineheight = 0.8,
position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = scales::percent)
The third variable, Register, is represented by two separate values within a single-coloured box, e.g., 684/20.22% (S1B) and 931/27.52% (S1A) for the variety GB. While I can infer from my data which of these two values stands for S1A or S1B, I need this to be apparent from the barplot as well. For example, would it be possible to add a label to "684/20.22%" that indicates that it is the S1B value?
Another obvious problem is that the data for the x-value "null" contains very low percentages, making it hard to read. I'm not sure what would be the best way to handle this. Perhaps it would make sense to do away with the numbers altogether and rely on colours only.
I'd be very grateful for any suggestions or solutions to my problem. I'm still a beginner and hope to become better at using R for data analysis.