12

I'm having difficulty with a few outliers making the color scale useless.

My data has a Length variable that is based in a range, but will usually have a few much larger values. The below example data has 95 values between 500 and 1500, and 5 values over 50,000. The resulting color legends tend to use 10k, 20k, ... 70k for the color changes when I want to see color changes between 500 and 1500. Really, anything over around 1300 should be the same solid color (probably median +/- mad), but I don't know where to define that.

I'm open to any ggplot solution, but ideally lower values would be red, middle white, and higher blue (low is bad). In my own dataset, date is an actual date with as.POSIXct() in the ggplot aes(), but doesn't seem to affect the example.

#example data
date <- sample(x=1:10,size=100,replace=T)
stateabbr <- sample(x=1:50,size=100,replace=T)
Length <- c(sample(x=500:1500,size=95,replace=T),60000,55000,70000,50000,65000)
x <- data.frame(date=date,stateabbr=stateabbr,Length=Length)

#main plot
(g <- ggplot(data=x,aes(x=date,y=factor(stateabbr))) +
  geom_point(aes(color=as.numeric(as.character(Length))),alpha=3/4,size=4) + 
  #scale_x_datetime(labels=date_format("%m/%d")) + 
  opts(title="Date and State") + xlab("Date") + ylab("State"))

#problem
g + scale_color_gradient2("Length",midpoint=median(x$Length))

Adding trans="log" or "sqrt" doesn't quite do the trick either.

Thank you for your help!

ARobertson
  • 2,857
  • 18
  • 24
  • my workaround has been to use a log scale (or something like it) for coloring when I have outliers. However, I'd love to know if there is a better way! – Justin Mar 21 '12 at 20:05
  • Yeah, I had tried that, but it's still off for this example. Hopefully a better way comes up! – ARobertson Mar 21 '12 at 20:14
  • You can use ?cut, to create another variable to your preffered breaks and then set the `color=` aesthetic to that variable. – Brandon Bertelsen Mar 21 '12 at 20:36

3 Answers3

9

Here's one slightly tricky options:

#Create a new variable indicating the unusual values
x$Length1 <- "> 1500"
x$Length1[x$Length <= 1500] <- NA

#main plot
# Using fill - tricky!
g <- ggplot() +
  geom_point(data = subset(x,Length <= 1500),
             aes(x=date,y=factor(stateabbr),color=Length),size=4) + 
  geom_point(data = subset(x,Length > 1500),
             aes(x=date,y=factor(stateabbr),fill=Length1),size=4)+
  opts(title="Date and State") + xlab("Date") + ylab("State")

#problem
g + scale_color_gradient2("Length",midpoint=median(x$Length))

enter image description here

So the tricky part here is using fill on points, in order to convince ggplot to make another legend. You can obviously customize this with different labels and colors for the fill scale.

One more thing, reading Brandon's answer. You could in principle combine both approaches by taking the outlying values, using cut to create a separate categorical variable for them, and then use my trick with the fill scale. That way you could indicate multiple outlying groups of points.

joran
  • 169,992
  • 32
  • 429
  • 468
6

From my comment, see ?cut

x$colors <- cut(x$Length, breaks=c(0,500,1000,1300,max(x$Length)))

g <- ggplot(data=x,aes(x=date,y=factor(stateabbr),color=colors)) +
    geom_point() + 
    opts(title="Date and State") + 
    xlab("Date") + 
    ylab("State")
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • In this case, I would have to supply "continuous-looking" colors to a discrete variable with scale_color_manual, right? I'm getting discrete coloring, which isn't bad, just an observation. – ARobertson Mar 26 '12 at 18:26
  • Yes, to fit your original question (red -> white -> blue). Try something like + scale_colour_manual(values=c("red","white","blue")). See here for more effective pallettes: http://learnr.wordpress.com/2009/04/15/ggplot2-qualitative-colour-palettes/ I think the colourspace pallettes designation would likely suit your need best. Just remember that you need a colour for each break you create with cut. It's not hard to fake it so it "seems" continuous, with a bit of clever pallette usage. – Brandon Bertelsen Mar 27 '12 at 06:14
3

Get rid of the outliers. Quick and dirty, I know, but I think it was worth saying. You can always describe them in your text. Why let them ruin your analyses and graphs?

There's a paper referenced in this blog post which deals with ethically removing outliers:

http://psuc2f.wordpress.com/2011/10/14/is-it-dishonest-or-unethical-to-remove-outliers/

Another simple way of dealing with them would be to cap them:

df$Value[df$Value>1300]=1300

Again, you can describe that you did this in the text or even just edit the scale to say 1300+ instead of 1300

Chris Beeley
  • 591
  • 6
  • 22