0

Data https://drive.google.com/file/d/1YuhqzBbQfdJx9MWYmc2nrlgOO-IyARoK/view?usp=sharing

How would I be able to label the outliers from my given data. I would like to know which sites were the outliers . Here are my codes so far. Thanks

# without jitter
ggplot(data=df, aes(x=variable, y=value, fill=variable)) + geom_boxplot() + theme_bw() + labs(x="Environmental Parameters", y="Standardized Range")+theme(legend.position = "none") +  theme(text=element_text(family="Times New Roman", face="bold", size=12))
#with
ggplot(data=df, aes(x=variable, y=value, fill=variable)) + geom_boxplot() + theme_bw() + labs(x="Environmental Parameters", y="Standardized Range")+theme(legend.position = "none") +  theme(text=element_text(family="Times New Roman", face="bold", size=12)) + geom_jitter(position=position_jitter(0.1))
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 1
    It's a lot better if you can just include a sample of data in the post itself so folks don't have to download from a third-party site; [see here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for some methods. You have to figure out how to define outliers before you can mark them--how do you want to go about that? – camille Nov 19 '19 at 13:56
  • Hi @Joerick, your code is not reproducible, because doesn't exist the variable called 'variable' in your sample. To facilitate the stuff, could you provide a reproducible piece of code? – claudius Nov 19 '19 at 14:02
  • Apologies. I had to do this #df<- melt(StanEnvCCA) before plotting. – Joerick Paderogao Nov 19 '19 at 14:09

2 Answers2

1

As suggested by @jtr13 in this answer [1], to explicit the outliers in the boxplot, extract a list of outliers values with the ggplot_build function and use the map_df function to convert this list into a tibble, that will be used in the geom_text for highlight the outliers .
Below we see the boxplot with the outliers highlighted in red.

enter image description here


# load packages
require(tidyverse)
require(reshape)

# read data

# path = '/'
file_path<- paste0(path, '/StanEnvCCA.csv')

StanEnvCCA <- 
  read.csv(file_path, 
           header = T,
           sep = ';',
           dec = '.') 

# transform
df<- melt(StanEnvCCA) 


# calculate boxplot object
g <- ggplot(data=df, aes(x=variable, y=value, fill=variable)) + 
  geom_boxplot() + 
  theme_bw() + 
  labs(x="Environmental Parameters", y="Standardized Range")+
  theme(legend.position = "none") +  
  theme(text=element_text(family="Times New Roman", face="bold", size=12)) + 
  geom_jitter(position=position_jitter(0.1))

# get list of outliers 
out <- ggplot_build(g)[["data"]][[1]][["outliers"]]

# label list elements with factor levels
names(out) <- levels(factor(df$variable))

# convert to tidy data
tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "variable")

# plot boxplots with labels
g + geom_text(data = tidyout, aes(variable, value, label = variable), 
              hjust = -.3, colour='red')

claudius
  • 747
  • 1
  • 10
  • 24
-1

Save the file to the workplace and load it. I used the file.choose() just for speeding things up.

filename <- file.choose()
bd<-read.xlsx(filename)

put the variable names as labels to each value

bd<-data.frame(bd[0:0], stack(bd[2:ncol(bd)]))

make the plot

g<-ggplot(data=bd, aes(x=bd$ind, y=bd$values)) + geom_boxplot() + theme_bw()

extract the outliers from the plot

out <- ggplot_build(g)[["data"]][[1]][["outliers"]]

label the list

names(out) <- levels(factor(bd$ind))

tidy the data

tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "ind")

plot your boxplots

g + geom_text(data = tidyout, aes(tidyout$ind, tidyout$value, label = tidyout$value), 
              hjust = -.3)

This is an adaptation of jtr13's answer from this post Labeling Outliers of Boxplots in R.

Hope it helps.

Virgil Ion
  • 121
  • 4
  • I already have the function. I am having trouble in the succeeding lines because of the difference in the data structure I have from the abovementioned post. There are some arguments that I cannot use in mine. – Joerick Paderogao Nov 19 '19 at 13:54
  • 2
    Please only use the answers section for complete answers, not just links to previous posts. If you think this is a duplicate, it can be flagged as such – camille Nov 19 '19 at 13:54