0

How can I modify the shape of the outliers in geom_boxplot to match the sample ID over time. Imagine I have this kind of data (this is just dummy data, the code might not be pretty but that's what I came up with):

# create dummy data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
  if (time == 'T1') {
    sam <- 1
  }
  for (group in as.factor(c('A','B'))) {
    for (pat in 1:10) {
      df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
      df[pat + os, 'Time'] <- time
      df[pat + os, 'Group'] <- group
      df[pat + os, 'Value'] <- rnorm(1) + os
      # add outlier, they are the same in each group in this example,
      # but can differ in the real data set
      if (pat == 2 | pat == 9) {
        print(pat)
        df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
      }
      sam <- sam + 1
    }
    os <- os + 10
  }
}

# mark outliers in table
df = df %>% 
  group_by(Group,Time)  %>%
  mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ TRUE,
                                Value < quantile(Value)[2] - 1.5*IQR(Value) ~ TRUE,
                                TRUE ~ FALSE))

This results in the following plot:

ggplot(df, aes(x = Time,
               y = Value,
               label = Time)) +
  geom_boxplot(outlier.colour =  'red',
               outlier.shape = 1,
               outlier.size = 2
  ) +
  facet_grid(~factor(Group),
             switch = 'x',
             scales = 'free_y')

test plot with outliers

Goal:

What I want is that for each group A or B I can see if the outliers are the same. So for instance that in A T0 the shown outlier is the same as in A T1. More specifically the outlier seen as a circle in A T0 should be a circle in A T1 and the second outlier in A T1should be any other shape (e.g. triangle). Since my original data has about 5/6 time points it would be nice to know if an outlier stays an outlier by looking at the plot. In some cases my original dataset has about 5-8 outliers.

In group B we can reuse the same shapes as in group A although we have different sample ID's than in group A.

I want to use basic shapes like triangles, circles, Asterix and so on (I know the shapes are limited but for my kind of dataset it should suffice). I also know that I can label the data points, but that I don't want. Different colour would be okay too, but I'd prefer different shapes.

I guess I have to calculate outliers separately and then maybe use geom_point with aes(shape = df$Sample) or something. But I can't figure it out.

Does anybody has a hint or a solution based on my dummy data? That would be awesome :-)

Best TMC

TMC
  • 75
  • 10
  • 1
    The answer of Heroke here will do this for you. https://stackoverflow.com/questions/33524669/labeling-outliers-of-boxplots-in-r – Roman Mar 24 '22 at 13:31
  • @Roman as far as I can se it adds only a label to the outlier. Now, How do I use that for the *shape* of the outlier? – TMC Mar 24 '22 at 14:16
  • 1
    Adapt the other answer to use `geom_point` instead of drawing a label. In the original plot set the `outlier.shape = NA` as suggested by the `geom_boxplot` documentation for such situations. – AEF Mar 24 '22 at 14:23
  • I'm sorry.. I don't get it. If I add `geom_point(aes(shape=label))` it draws every data point, and only the outlier are shaped differently. If I change `dat[,label:=ifelse(outlier,rawDF$Sample,NA)]` the outlier are still the same shape... Not to mention that I have Group A and B and T0 and T1 .. – TMC Mar 24 '22 at 15:04
  • I changed the original post to add code to calculate if a value is an outlier, since I don't get the formula working suggested from Roman. Question still remains how I can change the shape of the outlier, so that we can follow along if an outlier in T0 is still an outlier in T1 without adding a label. – TMC Mar 24 '22 at 15:33

1 Answers1

0

I figured out a really ugly solution. I'm pretty sure there is a prettier way to do this but here is the full code:

First we create dummy data:

# start with an clean environment
rm(list=ls())  
# create a function to load or install all necessary libraries 
install.load.package <- function(x) {
  if (!require(x, character.only = TRUE))
    install.packages(x)
  require(x, character.only = TRUE)
}
package_vec <- c("ggplot2",
                 "dplyr"
)
sapply(package_vec, install.load.package)  

# now to the data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
  if (time == 'T1') {
    sam <- 1
  }
  for (group in as.factor(c('A','B'))) {
    for (pat in 1:10) {
      df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
      df[pat + os, 'Time'] <- time
      df[pat + os, 'Group'] <- group
      df[pat + os, 'Value'] <- rnorm(1) + os
      # add outlier, they are the same in each group in this example,
      # but can differ in the real data set
      if (pat == 2 | pat == 9) {
        print(pat)
        df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
      }
      sam <- sam + 1
    }
    os <- os + 10
  }
}

Then we calculate the outliers as following, and create a new column where the ID of the Outlier is placed. If it is not an outlier an 'X' is inserted

# calculate outliers
df = df %>% 
  group_by(Group,Time)  %>%
  mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ as.character(Sample),
                                Value < quantile(Value)[2] - 1.5*IQR(Value) ~ as.character(Sample),
                                TRUE ~ as.character('X')))
df$Group <- as.factor(df$Group)

Now, we replace the Sample ID with a number. The first outlier pair(s) gets the number 1, the second gets a 2 and so on. If there are more outliers than available `geom_points' shapes, the code has to be adapted. But lets just assume we don't have more than 23 outliers (I think that's the maximum amount).

for (group in levels(df$Group)) {
  count <- 1
  for (id in levels(as.factor(df$is_outlier[which(df$Group == group)]))) {
    if (id == 'X') {
      df[which(df$is_outlier == id), 'is_outlier'] <- as.character(NA)
    } else {
      df[which(df$is_outlier == id), 'is_outlier'] <- as.character(count)
      count <- count + 1
    }
  }
}

this overwrites the previously created column. Its introducing NA's for the X values.

now we can plot the data

  ggplot(df, aes(x = Time,
              y = Value,
              label = Time)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data = df,
             shape= as.numeric(df$is_outlier),
             color = 'red') +
  facet_grid(~factor(Group),
             switch = 'x',
             scales = 'free_y')

This results in this plot:

outlier with identity shape

Now we can see if an outlier stays an outlier from T0 to T1. Be aware that in Group B we use the same shape. But these are totally different samples. One has to adapt the code above the plotting code to account for this. But this way we would have potentially less shapes available.

If one of you has a smoother and more elegant solution, I'd be happy to learn.

Best TMC

TMC
  • 75
  • 10