6

I'd like to create a ggplot2 histogram in which the plot's limits are equal to the smallest and largest values in the data set, without excluding those values from the actual histogram.

I get the behavior I'm looking for when using base graphics. Specifically, the second histogram below shows all of the same values as the first histogram (i.e., no bins are excluded in the second histogram), even though I've included an xlim argument to the second plot:

min_wt <- min(mtcars$wt)
max_wt <- max(mtcars$wt)
xlim <- c(min_wt, max_wt)

hist(mtcars$wt, breaks = 30, main = "No limits added")

hist(mtcars$wt, breaks = 30, xlim = xlim, main = "Limits added")

enter image description here enter image description here

ggplot2 isn't giving me this behavior though:

library(ggplot2)

# Using green colour to make dropped bins easy to see:
p <- ggplot(mtcars, aes(x = wt)) + geom_histogram(colour = "green", bins = 30)
p + ggtitle("No limits added")

p + xlim(xlim) + ggtitle("Limits added") 

enter image description here enter image description here

See how in the second plot I lose one of the points that is below 2 and 2 of the points that are above 5? I would like to know how to fix this. A few misc notes:

First, specifying boundary allows me to include the minimum values (i.e., those below 2) in the histogram, but I still don't have a solution to the 2 values greater than 5 that are getting dropped:

ggplot(mtcars, aes(x = wt)) + 
  geom_histogram(bins = 30, colour = "green", boundary = min_wt) + 
  xlim(xlim) +
  ggtitle("Limits added with boundary too")

enter image description here

Second, the presence of the issue is dependent on the value chosen for bins. For example, when I increase bins to be 50, I don't get any dropped values:

ggplot(mtcars, aes(x = wt)) + 
  geom_histogram(bins = 50, colour = "green", boundary = min_wt) + 
  xlim(xlim) +
  ggtitle("Limits added with boundary too, but with bins = 50")

enter image description here

Finally, I believe this issue is related to the one presented on SO here: geom_histogram: wrong bins? and discussed here as well: https://github.com/tidyverse/ggplot2/issues/1651. In other words, I think this issue is related to a "rounding error." I describe this error in more depth in my second post (the one with the graphs shown in it) on this issue: https://github.com/daattali/ggExtra/issues/81.

Here is my session info:

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] ggplot2_2.2.1

loaded via a namespace (and not attached):
 [1] labeling_0.3      colorspace_1.3-2  scales_0.5.0.9000
 [4] compiler_3.4.2    lazyeval_0.2.1    plyr_1.8.4       
 [7] tools_3.4.2       pillar_1.2.1      gtable_0.2.0     
[10] tibble_1.4.2      yaml_2.1.16       Rcpp_0.12.15     
[13] grid_3.4.2        rlang_0.2.0.9000  munsell_0.4.3 
Chris
  • 1,575
  • 13
  • 20
  • This is probably due to the right edge of the right-most bin exceeding the maximum `xlim` value. When you specify limits using `xlim`, which is equivalent to `scale_x_continuous(limits=xlim)`, data outside the limits (including any derived "data" such as statistics, regressions, bins, etc., calculated from the data) is excluded from the plot. – eipi10 Mar 10 '18 at 05:58
  • 3
    (continued) In this case, since part of a bin is outside the limits, the data in that bin is excluded. The most straightforward way to avoid this here is to use `coord_cartesian(xlim=xlim)` instead of `xlim(xlim)`, since `coord_cartesian` sets limits without excluding data that is outside the limits. – eipi10 Mar 10 '18 at 06:00
  • Thanks @eipi10, `coord_cartesian()` does what I want. And yes, technically part of the bin is outside the limits, but only because of a rounding error (as mentioned here: https://github.com/daattali/ggExtra/issues/81) – Chris Mar 12 '18 at 13:54

1 Answers1

2

Another option to what was mentioned by @eipi10 in the comments, is to change the oob (out of bounds) argument in scale_x_continuous.

Function that handles limits outside of the scale limits (out of bounds). The default replaces out of bounds values with NA.

The default uses scales::censor(), you can change that to be oob = scales::squish, which squishes values into a range.

Compare the following two plots.

p + scale_x_continuous(limits = xlim) + ggtitle("default: scales::censor")

warning: Removed 1 rows containing missing values (geom_bar).

enter image description here

p + scale_x_continuous(limits = xlim, oob = scales::squish) + ggtitle("using scales::squish")

enter image description here

Your third ggplot, where you specified a boundary but still 2 values greater than 5 got dropped would look like this.

ggplot(mtcars, aes(x = wt)) + 
 geom_histogram(bins = 30, colour = "green", boundary = min_wt) + 
 scale_x_continuous(limits = xlim, oob = scales::squish) +
 ggtitle("Limits added with boundary too") +
 labs(subtitle = "scales::squish")

enter image description here

Hope this helps.

markus
  • 25,843
  • 5
  • 39
  • 58
  • Thanks for the great answer...Using `boundary = min_wt` and `oob = scales::squish` seems to be doing what I need – Chris Mar 12 '18 at 16:24