5

I want to draw boxplots in R and add names to outliers. So far I found this solution.

The function there provides all the functionality I need, but it scrambles incorrectly the labels. In the following example, it marks the outlier as "u" instead of "o":

library(plyr)
library(TeachingDemos)
source("http://www.r-statistics.com/wp-content/uploads/2011/01/boxplot-with-outlier-label-r.txt") # Load the function
set.seed(1500)
y <- rnorm(20)
x1 <- sample(letters[1:2], 20,T)
lab_y <- sample(letters, 20)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x1, lab_y)

Do you know of any solution? The ggplot2 library is super nice, but provides no such functionality (as far as I know). My alternative is to use the text() function and extract the outlier information from the boxplot object. However, like this the labels may overlap.

Thanks a lot :-)

Federico Giorgi
  • 10,495
  • 9
  • 42
  • 56
  • 2
    Update: I brought this error to Tal Galili's attention, and w/in hours, he posted an edited version of the script that no longer exhibits this problem. – Josh O'Brien Oct 31 '11 at 22:20

2 Answers2

6

I took a look at this with debug(boxplot.with.outlier.label), and ... it turns out there's a bug in the function.

The error occurs on line 125, where the data.frame DATA is constructed from x,y and label_name.

Previously x and y have been reordered, while lab_y hasn't been. When the supplied value of x (your x1) isn't itself already in order, you'll get the kind of jumbling you experienced.

As an immediate fix, you can pre-order the x values like this (or do something more elegant)

df <- data.frame(y, x1, lab_y, stringsAsFactors=FALSE)
df <- df[order(df$x1), ]
# Needed since lab_y is not searched for in data (though it probably should be)
lab_y <- df$lab_y  

boxplot.with.outlier.label(y~x1, lab_y, data=df)

Boxplot produced by procedure described above

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Thanks Josh. I ran into same problem a few days ago so this was a great help +1 – pssguy Oct 31 '11 at 13:40
  • 1
    Glad to help. Since this looks like it's of use to some other people, I've just emailed the script's author, as he'd asked users to do if they find errors in the script. – Josh O'Brien Oct 31 '11 at 15:09
  • Update: Thanks to Josh asking this question (and detecting the point of failure in the function) - I was able to upload an updated version which solves this problem. Thank you Josh. Best, Tal – Tal Galili Nov 01 '11 at 08:32
1

The intelligent point label placement is a separate issue discussed here or here. There's no ultimate and ideal solution so you just have to pick one there.

So you would overplot the normal boxplot with labels, as follows:

set.seed(1501)
y <- c(4, 0, 7, -5, rnorm(16))
x1 <- c("a", "a", "b", "b", sample(letters[1:2], 16, T))
lab_y <- sample(letters, 20)

bx <- boxplot(y~x1)

out_lab <- c()
for (i in seq(bx$out)) { 
    out_lab[i] <- lab_y[which(y == bx$out[i])[1]]
}

identify(bx$group, bx$out, labels = out_lab, cex = 0.7)

Then, during the identify() is running, you just click to position where you want the label, as described here. When finished, you just press "STOP". Note that each outlier can have more than one label! In my solution, I just simply picked the first!!

PS: I feel ashamed for the for loop, but don't know how to vectorize it - feel free to post improvement.

EDIT: inspired by the Federico's link now I see it can be done much easier! Just these 2 commands:

boxplot(y~x1)
identify(as.integer(as.factor(x1)), y, labels = lab_y, cex = 0.7)
Community
  • 1
  • 1
Tomas
  • 57,621
  • 49
  • 238
  • 373