4

I am trying to label outliers with ggplot. Regarding my code, I have two questions:

  1. Why does it not label outliers below 1.5*IQR?

  2. Why does it not label outliers based on the group they are in but instead apparently refers to the overall mean of the data? I would like to label outliers for each box plot individually. I.e. the outliers for Country A in Wave 1 (of a survey), etc.

A sample of my code:

PERCENT <- rnorm(50, sd = 3)
WAVE <- sample(6, 50, replace = TRUE)
AGE_GROUP <- rep(c("21-30", "31-40", "41-50", "51-60", "61-70"), 10)
COUNTRY <- rep(c("Country A", "Country B"), 25)
N <- rnorm(50, mean = 200, sd = 2)

df <- data.frame(PERCENT, WAVE, AGE_GROUP, COUNTRY, N)

ggplot(df, aes(x = factor(WAVE), y = PERCENT, fill = factor(COUNTRY))) +
  geom_boxplot(alpha = 0.3) +
  geom_point(aes(color = AGE_GROUP, group = factor(COUNTRY)), position = position_dodge(width=0.75)) +
  geom_text(aes(label = ifelse(PERCENT > 1.5*IQR(PERCENT)|PERCENT < -1.5*IQR(PERCENT), paste(AGE_GROUP, ",", round(PERCENT, 1), "%, n =", round(N, 0)),'')), hjust = -.3, size = 3)

A picture of what I have so far: Outlier Label

enter image description here

I appreciate your help!

Claus Wilke
  • 16,992
  • 7
  • 53
  • 104
Tea Tree
  • 882
  • 11
  • 26
  • I wonder if [**this question**](https://stackoverflow.com/questions/33524669/labeling-outliers-of-boxplots-in-r) is helpful for you. – jazzurro Dec 16 '17 at 05:39
  • Thanks! I saw this when googling the problem. I was hoping to solve the problem without adding an additional column to my data frame. – Tea Tree Dec 16 '17 at 06:08
  • [This](https://stackoverflow.com/questions/47740448/cannot-place-count-label-at-boxplot-whisker-with-outliers-present/47741716#47741716) may also be helpful. – Claus Wilke Dec 16 '17 at 06:29
  • And [this](https://stackoverflow.com/questions/47777159/ggplot2-geom-boxplot-annotating-counts-without-computing-them-in-advance/47777280#47777280) regarding your second question, about grouping. – Claus Wilke Dec 16 '17 at 06:31
  • Apparently my first and second question go together. And it seems like what I have been trying to do can only be done by adding another column to the data frame because of the way ggplot works. I'll give this a try! Thanks for pointing me to these resources! – Tea Tree Dec 16 '17 at 06:39
  • Which version of ggplot2 are you running? I cannot reproduce your figure. – Claus Wilke Dec 16 '17 at 17:06
  • I use the latest version (2.2.1) and I have loaded the following libraries: haven, readr, ggplot2, dplyr, stringr – Tea Tree Dec 17 '17 at 00:53
  • I solved the problem using [JasonAizkalns's] (https://stackoverflow.com/questions/33524669/labeling-outliers-of-boxplots-in-r) function. Even though I had to add an extra column to my data, it was the most convenient way for me to do it. I appreciate your help. – Tea Tree Dec 17 '17 at 00:58

2 Answers2

3

If you want IQR to be calculated by country, you need to group the data. You could probably do it globally(i.e. before you send the data to ggplot) or locally in the layer.

library(dplyr)
library(ggplot2)

ggplot(df, aes(x = as.factor(WAVE), y = PERCENT, fill = COUNTRY)) +
  geom_boxplot(alpha = 0.3) +
  geom_point(aes(color = AGE_GROUP, group = COUNTRY), position = position_dodge(width=0.75)) +
  geom_text(aes(group = COUNTRY, label = ifelse(!between(PERCENT,-1.3*IQR(PERCENT), 1.3*IQR(PERCENT)), 
                                                paste(" ",COUNTRY, ",", AGE_GROUP, ",", round(PERCENT, 1), "%, n =", round(N, 0)),'')), 
            position = position_dodge(width=0.75),
            hjust = "left", size = 3)
dmi3kno
  • 2,943
  • 17
  • 31
  • The line `data=. %>% group_by(COUNTRY)` has no effect as far as I can tell, and I also wouldn't know why it should have one. The currently released ggplot2 version does not respect groupings from dplyr. – Claus Wilke Dec 16 '17 at 17:03
  • I see. `group=COUNTRY` in aesthetics is enough. I corrected the answer. – dmi3kno Dec 16 '17 at 18:11
  • Actually, as far as I can tell, even that is not needed. I'm not sure why the OPs result looks the way it does. – Claus Wilke Dec 16 '17 at 18:31
2

Adding the group aesthetic to geom_text and modifying the ifelse test should do what you want.

Setting group = interaction(WAVE, COUNTRY) will restrict the calculations to within each boxplot, and the outliner test needs to include a call to median(PERCENT).

library(ggplot2)
set.seed(42)

PERCENT   <- rnorm(50, sd = 3)
WAVE      <- sample(6, 50, replace = TRUE)
AGE_GROUP <- rep(c("21-30", "31-40", "41-50", "51-60", "61-70"), 10)
COUNTRY   <- rep(c("Country A", "Country B"), 25)
N         <- rnorm(50, mean = 200, sd = 2)

df <- data.frame(PERCENT, WAVE, AGE_GROUP, COUNTRY, N)

ggplot(df) +
  aes(x = factor(WAVE),
      y = PERCENT,
      fill = factor(COUNTRY)) +
  geom_boxplot(alpha = 0.3) +
  geom_point(aes(color = AGE_GROUP, group = factor(COUNTRY)), position = position_dodge(width=0.75)) + 

  geom_text(aes(group = interaction(WAVE, COUNTRY),
                label = ifelse(test = PERCENT > median(PERCENT) + 1.5*IQR(PERCENT)|PERCENT < median(PERCENT) -1.5*IQR(PERCENT),
                               yes  = paste(AGE_GROUP, ",", round(PERCENT, 1), "%, n =", round(N, 0)),
                               no   = '')),
            position = position_dodge(width = 0.75),
            hjust = -.2,
            size = 3)

enter image description here

Peter
  • 7,460
  • 2
  • 47
  • 68