2

Any suggestions on how to solve problem? Unlike other similar questions here on the channel that presented the solution for a variable as a factor, my case is different. I would like to see the labels of the outliers for multiple variables.

I have the following chart as below. for example:

enter image description here

It was created with this command:

z_mtcars <-data.frame(scale(mtcars[-12]))
z_mtcars$type<-rownames(mtcars)
z_mtcars %>% melt(id.vars = "type")  %>%
ggplot() +
aes( x = variable, y = value, fill = as.numeric(variable)) +
geom_boxplot() +
scale_fill_distiller(palette = "Blues") +
scale_alpha(range = c(1,1)) +
ggtitle("Boxplot: Standardized Score (Z-Scale) ") +
xlab("Variables") +
ylab("Value") + 
labs(fill = "Order of \nVariables") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90,hjust = 1)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_hline(yintercept = 1, linetype = "dotted", color = "blue") +
theme(legend.position = "left")
tjebo
  • 21,977
  • 7
  • 58
  • 94
cnauber
  • 443
  • 5
  • 10
  • Hi, welcome to SO, not sure how your case is different as you don't say, but the goal is to label the boxplot outlier. How would you want it to be labeled? As it sits, the point says 'and there are these', so via label what do you want to say about them? – Chris Jul 23 '20 at 02:57
  • This same question had been marked as a duplicate yesterday - I am not quite sure how this is different - could you kindly elaborate? Also please kindly consider not deleting questions that are marked as duplicates in the future, because they are generally quite helpful for others to find answers to similar questions because it increases "visibility" for search engines – tjebo Jul 23 '20 at 06:44

2 Answers2

2

Here is what I tried. I simplified your code a bit to highlight the point you are asking. You want to somehow find label information of the outliers. You can identify outliers using the borrowed function below. When you identify them, you add car names in a new column called outlier. You use this information in geom_text_repel() in the ggrepel package.

library(tidyverse)
library(ggrepel)

z_mtcars <- data.frame(scale(mtcars[-12]))
z_mtcars$type <- rownames(mtcars)

I borrowed this function from this question. Credit goes to JasonAizkalns.

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

z_mtcars %>%
pivot_longer(names_to = "variable", values_to = "value", -type) %>% 
group_by(variable) %>% 
mutate(outlier = if_else(is_outlier(value), type, NA_character_)) %>% 
ggplot(aes(x = variable, y = value, color = variable)) +
geom_boxplot() +
geom_text_repel(aes(label = outlier), na.rm = TRUE, show.legend = F) 

enter image description here

jazzurro
  • 23,179
  • 35
  • 66
  • 76
0

In the code below, we use geom_text to add labels to the outliers. Within geom_text, we calculate the outlier locations and filter the data down to the outliers. One odd thing is that I had to set coef to 1.4 in boxplot.stats (instead of the default 1.5) in order to get all of the outliers included. Not sure why. I've also switched to pivot_longer from dplyr, since reshape2 is outdated, and kept only the relevant portions of code.

library(tidyverse)
  
z_mtcars <- data.frame(scale(mtcars[-12]))
z_mtcars$type <- rownames(mtcars)

z_mtcars %>% 
  pivot_longer(cols=-type) %>% 
  ggplot(aes(x = name, y = value)) +
    geom_boxplot() +
    geom_text(data=. %>% 
                group_by(name) %>%  
                filter(value %in% boxplot.stats(value, coef=1.4)$out),
              aes(label=type, y=value), nudge_x=0.1, colour="red", size=3, hjust=0) +
    theme_classic() +
    expand_limits(x=12.6)

enter image description here

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Thanks. It must be something related to the boxplot.stats function calculations: For example, return of is_outliers (z_mtcar $ wt): 15 2.077504765 TRUE 16 2.255335698 TRUE 17 2.174596366 TRUE return of is_outliers: boxplot.stats (z_mtcart $ wt, coef = 1.5) 16 2.255335698 TRUE 17 2.174596366 TRUE – cnauber Jul 23 '20 at 21:12
  • Yes, it's related to `boxplot.stats`, but `geom_boxplot` is supposedly using `boxplot.stats` with `coef=1.5` to determine outliers. I'm therefore not sure why I get fewer outliers with (ostensibly) the same calculation outside of geom_boxplot. – eipi10 Jul 23 '20 at 21:21