Data labels for mean and percentiles in a distribution chart

Question

I'm creating a custom chart to visualize a variable's distribution using geom_density. I added 3 vertical lines for a custom value, the 5th percentile and the 95th percentile.

How do I add labels for those lines?

I tried using geom_text but i don't know how to parameter the x and y variables

library(ggplot2)

ggplot(dataset, aes(x = dataset$`Estimated percent body fat`)) + 
  geom_density() +
  geom_vline(aes(xintercept = dataset$`Estimated percent body fat`[12]), 
             color = "red", size = 1) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.05, na.rm = TRUE)), 
             color = "grey", size = 0.5) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.95, na.rm = TRUE)), 
             color="grey", size=0.5) +

  geom_text(aes(x = dataset$`Estimated percent body fat`[12], 
                label = "Custom", y = 0), 
            colour = "red", angle = 0)

I'd like to obtain the following:

for the custom value, I'd like to add the label at the top of the chart, just to the right of the line
for the percentiles label, I'd like to add them in the middle of the chart; at the left of the line for the 5th percentile and right of the line for 95th percentile

Here is what I was able to obtain https://i.stack.imgur.com/lfRRK.png

And these are the first 50 lines of my dataset:

structure(list(`Respondent sequence number` = c(21029L, 21034L, 
21043L, 21056L, 21067L, 21085L, 21087L, 21105L, 21107L, 21109L, 
21110L, 21125L, 21129L, 21138L, 21141L, 21154L, 21193L, 21195L, 
21206L, 21215L, 21219L, 21221L, 21232L, 21239L, 21242L, 21247L, 
21256L, 21258L, 21287L, 21310L, 21325L, 21367L, 21380L, 21385L, 
21413L, 21418L, 21420L, 21423L, 21427L, 21432L, 21437L, 21441L, 
21444L, 21453L, 21466L, 21467L, 21477L, 21491L, 21494L, 21495L
), `Estimated percent body fat` = c(NA, 7.2, NA, NA, 24.1, 25.1, 
30.2, 23.6, 24.3, 31.4, NA, 14.1, 20.5, NA, 23.1, 30.6, 21, 20.9, 
NA, 24, 26.7, 16.6, NA, 26.9, 16.9, 21.3, 15.9, 27.4, 13.9, NA, 
20, NA, 12.8, NA, 33.8, 18.1, NA, NA, 28.4, 10.9, 38.1, 33, 39.3, 
15.9, 32.7, NA, 20.4, 16.8, NA, 29)), row.names = c(NA, 50L), class = 
"data.frame")

Welcome to Stack Overflow! Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) — Tung, Mar 30 '19 at 04:49
@Luca Please **edit your question** with additional information. Don't put them in a comment. — Z.Lin, Mar 30 '19 at 05:34
Thanks both. I just added an example of the chart with reprex, and dataset example with dput — Luca, Mar 30 '19 at 05:52
@Luca Are you strongly depending on `ggplot`? I find this easier to achieve with base plots. — jay.sf, Mar 30 '19 at 08:08
@jay.sf I'm happy to use base plots instead if it allows me to add the labels — Luca, Mar 30 '19 at 08:30

jay.sf · Accepted Answer · 2019-03-30T08:56:34.777

First I recommend clean column names.

dat <- dataset
names(dat) <- tolower(gsub("\\s", "\\.", names(dat)))

Whith base R plots you could do the following. The clou is, that you can store the quantiles and custom positions to use them as coordinates later which gives you a dynamic positioning. I'm not sure if/how this is possible with ggplot.

plot(density(dat$estimated.percent.body.fat, na.rm=TRUE), ylim=c(0, .05), 
     main="Density curve")
abline(v=c1 <- dat$estimated.percent.body.fat[12], col="red")
abline(v=q1 <- quantile(dat$estimated.percent.body.fat, .05, na.rm=TRUE), col="grey")
abline(v=q2 <- quantile(dat$estimated.percent.body.fat, .95, na.rm=TRUE), col="grey")
text(c1 + 4, .05, c(expression("" %<-% "custom")), cex=.8)
text(q1 - 5.5, .025, c(expression("5% percentile" %->% "")), cex=.8)
text(q2 + 5.5, .025, c(expression("" %<-% "95% percentile")), cex=.8)

Note: Case you don't like the arrows just do e.g. "5% percentile" instead of c(expression("5% percentile" %->% "")).

Or in ggplot you could use annotate.

library(ggplot2)
ggplot(dataset, aes(x = dataset$`Estimated percent body fat`)) + 
  geom_density() +
  geom_vline(aes(xintercept = dataset$`Estimated percent body fat`[12]), 
             color = "red", size = 1) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.05, na.rm = TRUE)), 
             color = "grey", size = 0.5) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.95, na.rm = TRUE)), 
             color="grey", size=0.5) +
  annotate("text", x=16, y=.05, label="custom") +
  annotate("text", x=9.5, y=.025, label="5% percentile") +
  annotate("text", x=38, y=.025, label="95% percentile")

Note, that in either solution the result (i.e. exact label positions) depends on your export size. To learn how to control this, take e.g. a look into How to save a plot as image on the disk?.

Data

dataset <- structure(list(`Respondent sequence number` = c(21029L, 21034L, 
21043L, 21056L, 21067L, 21085L, 21087L, 21105L, 21107L, 21109L, 
21110L, 21125L, 21129L, 21138L, 21141L, 21154L, 21193L, 21195L, 
21206L, 21215L, 21219L, 21221L, 21232L, 21239L, 21242L, 21247L, 
21256L, 21258L, 21287L, 21310L, 21325L, 21367L, 21380L, 21385L, 
21413L, 21418L, 21420L, 21423L, 21427L, 21432L, 21437L, 21441L, 
21444L, 21453L, 21466L, 21467L, 21477L, 21491L, 21494L, 21495L
), `Estimated percent body fat` = c(NA, 7.2, NA, NA, 24.1, 25.1, 
30.2, 23.6, 24.3, 31.4, NA, 14.1, 20.5, NA, 23.1, 30.6, 21, 20.9, 
NA, 24, 26.7, 16.6, NA, 26.9, 16.9, 21.3, 15.9, 27.4, 13.9, NA, 
20, NA, 12.8, NA, 33.8, 18.1, NA, NA, 28.4, 10.9, 38.1, 33, 39.3, 
15.9, 32.7, NA, 20.4, 16.8, NA, 29)), row.names = c(NA, 50L), class = 
"data.frame")

Thank you @jay.sf and very good remark on storing the quantiles to use them as coordinates later, i will do it as soon as the label code is working. Your example works, however I would like to make the positioning of the label to be dynamic. I want to use the same code in different datasets (datasets are dynamically generated), and in other datasets the right y positioning might be 0.07, or 0.03 etc — Luca, Mar 30 '19 at 08:46
I unfortunately don't have dynamic positioning in my base solution.. the dynamic positioning is the part i really want to add — Luca, Mar 30 '19 at 08:48
In e.g. `text(c1 + 4...`, `c1` dynamically depends on your custom value. I consider this as dynamic. The `+ 4` depends on label length which always stays the same. — jay.sf, Mar 30 '19 at 08:51
Good point. I think the main issue is the y positioning: while now the maximum in the y axis is 0.05, it might be 0.1 in a different dataset. Regarding the x axis, I agree that your solution is dynamic, however with a different dataset (example: maximum is 30 instead of 40), then the +4 would become a bit too much or a bit too little. — Luca, Mar 30 '19 at 09:00
For x positioning, I imagine I can calculate the maximum for my x axis with a formula from my dataset. Then based on that, estimate the spacing for the label. For y positioning, I tried to look for a formula that gives me the maximum but I couldn't find it. Thank you @jay.sf for you help! — Luca, Mar 30 '19 at 09:03
For the y axis you could use the maximum of your density distribution, e.g. `max(density(dat$estimated.percent.body.fat, na.rm=TRUE)$y)`. — jay.sf, Mar 30 '19 at 09:07

Data labels for mean and percentiles in a distribution chart

1 Answers1