6

Problem:

I have a data frame with 2 variables (x, y). The y variable is "typically" varying in a "small range". There are few outliers in the data frame. Here's an example:

# uniform sample data frame
# y variable "typically" varying in a "small" range between 0 and 1
df = data.frame(
  x = 1:100,
  y = runif(100)
  )

# add 2 outlier to data frame
# yielding a data frame 
# with 99 normal values and 1 outlier
df[3, 2] = 50
df[4, 2] = -50

So the data frame has 98 typically values and 2 outliers in the y-variable, as you can see from the first 10 rows head(df, 10):

        x           y
1   1   0.9785541
2   2   0.2321611
3   3  50.0000000
4   4 -50.0000000
5   5   0.8316717
6   6   0.1135077
7   7   0.9633120
8   8   0.1473229
9   9   0.1436269
10 10   0.9252299

When plotting the data frame as bar plot (y~x), ggplot2 is automatically (& correctly) scaling the y-axis to the full range of observed y-values:

require("ggplot2")
ggplot(df, aes(x, y)) + geom_bar(stat="identity") 

unwanted plot, 2 outlier stretches the y scale, 98 data points for y-variable look almost same

In order to focus on "typical" values, I'd like ggplot2 to keep y-axis scale on "small" scale plot the outliers off axis limits.

Here's my first attempt:

lower.cut = quantile(df$y, 0.02)  
# = 0.01096518
upper.cut = quantile(df$y, 0.98)  
# = 0.9872347 

ggplot(df, aes(x, y)) + geom_bar(stat="identity") +
  coord_cartesian( ylim = c(-lower.cut*1.1, upper.cut*1.1) )

wanted plot appearance, but semi automatic .cut setting

Question:

The first attempt has the disadvantage that the 0.02 and 0.98 quantile setting are kind of arbitrary.

Is there a smarter (less arbitrary, more statistically proved) way to have ggplot2 automatically limit it's axis to typical values while allowing outliers to be plotted off axis limits ?

Answers I looked into:

Community
  • 1
  • 1
user2030503
  • 3,064
  • 2
  • 36
  • 53
  • 2
    I think the result depends on the nature of the outlier. Why is it there? Is it a mistake? Can it be annotated and then a second zoomed in graphic be shown? If it's a real value, perhaps you should be visualizing it as base natural log. – Statwonk Sep 22 '13 at 20:43
  • @Statwonk: the (overaccentuated) example reflects actually sales data for a product where the y-value is the price and x is just time sequence. The positive outlier happens rarely when the product is sold to a niche market where premium prices are accepted. Typically the product is sold to a commodity market where prices are lower and do not vary that much. The negative outlier happens, when the product is returned (accouting a negatice value) due to a claim, which rarely happens. I want to focus on the normal price development in the plot, while ignoring the special events. – user2030503 Sep 22 '13 at 20:58
  • Probably the best approach is to identify outliers in the data frame (e.g. as anything > x standard deviations away from the mean) and then remove them from the data frame before plotting; that way you get to use ggplot's automatic range-setting function without interference from the outliers. – Drew Steen Sep 22 '13 at 21:03
  • 1
    Voting to close as primarily opinion-based. Experience shows that topic of automatic removal of outliers tends to generate divergent opinions. I would assert that really is no statistical basis for removal of outliers, and if you believe otherwise , then the question still doesn't belong on a coding website. – IRTFM Sep 22 '13 at 22:05
  • @Dwin I think there are two questions here - one is about how, statistically, to choose outliers to remove, and belongs on stats.stackexchange. The other is about how to code that, and should be posed here once OP has decided how to choose outliers for removal. – Drew Steen Sep 23 '13 at 02:45
  • 2
    The poster already knows how to code a limit to the plotting range and is asking for statistical advice about a topic that tends to elicit strong opinions. – IRTFM Sep 23 '13 at 03:17

0 Answers0