1

In our datasets, we have a few absolutely huge outliers. If we plot (eg in a boxplot) and include the outliers, the axis will be so squeezed that it's useless. Log-scaling doesn't help. But we want to tell the reader that the outliers exist (and say how many, and on which side of the boxplot, positive or negative), preferably without adding text manually to the caption. Is there a good method for this? Preferably in R, Matplotlib or Seaborn.

This is different from eg Ignore outliers in ggplot2 boxplot because I don't want to ignore the outliers: I want to show that they exist, but not plot them.

Sample code:

# from https://stackoverflow.com/questions/5677885/ignore-outliers-in-ggplot2-boxplot
> library("ggplot")
> df = data.frame(y = c(-100, rnorm(100), 100))
> ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))

We see a boxplot that is useless because of the presence of outliers. If we follow the accepted answer at that link, we remove the outliers in a very nice way, but now the reader doesn't realise there were any outliers.

EDIT a couple of comments/answers ask what I actually want, but that is precisely the difficulty -- I know I want an automated graphical presentation of the outliers (together with the main data), but I don't know what this should look like, exactly. I hope someone in the community knows some best practice for this situation. I don't need help writing code to find outliers or add text to plots.

jmmcd
  • 731
  • 6
  • 15
  • Could you post some sample data and a bit of basic code to visualise the problem? – Thomas Kühn Mar 28 '19 at 10:54
  • Possible duplicate of [Ignore outliers in ggplot2 boxplot](https://stackoverflow.com/questions/5677885/ignore-outliers-in-ggplot2-boxplot) – Wimpel Mar 28 '19 at 10:54
  • @ThomasKühn, added code. – jmmcd Mar 28 '19 at 22:12
  • @Wimpel, not a duplicate as described in edit. – jmmcd Mar 28 '19 at 22:12
  • Do you have a preference for signalling outliers? E.g., were this a figure in a paper I'd just write about the outliers in the caption, so it's your turn, what do you want to do? – gboffi Nov 09 '19 at 22:43
  • @gboffi thanks see edit. I want automated graphical presentation but I don't know how it should look. – jmmcd Nov 10 '19 at 11:15
  • 1
    I don't think there is an established convention. An idea: draw all the outliers in, say, red and place a red arrow near the border to signal an out-of-bonds outlier, optionally place its value aside the arrow, in fine red print – gboffi Nov 10 '19 at 13:43

1 Answers1

0

The base function boxplot.stats() is what you need. See the help function for details on how outliers are identified. Here's one way to find and report on the presence of outliers.

  set.seed(123) # make reproducible
  y <- c(rnorm(3, -100), rnorm(3, 100), rnorm(100, 1))
  y <- sample(y) # mix 'em up
  out <- boxplot.stats(y)$out # find outliers
  lo <- out[out < median(y)] # collect low
  hi <- out[out > median(y)] # collect high
  sel.lo <- which(y %in% lo) # collect positions of low
  sel.hi <- which(y %in% hi) # collect positions of high

# Report on what was found
  sprintf("%d low outliers and %d high outliers found",
    length(lo), length(hi))
# [1] "3 low outliers and 3 high outliers found"

You could replace the values identified by sel.lo and sel.hi with placeholders at a more reasonable distance for plotting purposes. Of course changing the data and reapplying boxplot would likely change the statistics and change the definition of outliers.

The plot scale can be set with the values from boxplot.stats if preserving the original boxplot properties but without the outlier influence is important.

  ylim <- 1.1 * boxplot.stats(y)$stats[c(1, 5)] # ends of the whiskers
  par(mfrow = c(1,2), las = 2, mar = c(1, 4, 3, 1))
  boxplot(y, main = "All data")
  boxplot(y, ylim = ylim, main = "Outliers ignored")

boxplot examples

David O
  • 803
  • 4
  • 10
  • Well, "we want to tell the reader that the outliers exist (and say how many, and on which side of the boxplot, positive or negative), preferably without adding text manually to the caption". `sprintf()` tells the code author, not the plot reader. And of course this is by adding text manually to the plot. – jmmcd Nov 09 '19 at 21:39
  • I'm not sure what you mean by "manually" and how you want to convey the reader this information. This information is in `lo`, `sel.lo`, `hi` and `sel.hi`. The `sprintf` code simply illustrates how one can programmatically extract it. (And prove that it works!) The extracted information could be used to modify the data in the existing data frame. Alternatively, place it in another data.frame which can be used to add another layer in ggplot through your favorite `geom_xxx` function. Perhaps someone can suggest something if you provide an example of how you hope to convey the information. – David O Nov 09 '19 at 22:02
  • thanks see edit. I want automated graphical presentation but I don't know how it should look. – jmmcd Nov 10 '19 at 11:16