I'm noticing some strange behavior with geom_histogram
from ggplot2
. It seems to leave out a bar and I can't figure out why. Here's an example:
> # show the data
> head(df)
other_variable variable
1 0 3.663562
2 0 3.663562
3 0 3.663562
4 0 3.663562
5 0 -3.663562
6 1 -3.663562
>
> # select 25 random rows
> set.seed(1)
> var1 <- df[runif(25,0,nrow(df)),]$variable
>
> # display the data
> var1
[1] -3.6635616 3.6635616 3.6635616 3.6635616 -3.6635616 -0.8001193
[7] 3.6635616 3.6635616 3.6635616 3.6635616 -3.6635616 3.6635616
[13] 3.6635616 3.6635616 3.6635616 3.6635616 3.6635616 3.6635616
[19] 3.6635616 3.6635616 3.6635616 3.6635616 3.6635616 -1.2950457
[25] -3.6635616
>
> # histogram of var1 doesn't include values = 3.6635616
> ggplot(data=NULL, aes(x=var1)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram made in the last line seems to exclude the values of var1 in the highest category:
What's strange is that it's difficult to reproduce. If I make the variable "manually", the correct bars show up, though I suspect that it has to do with significant digits:
> # make a new vector with the same data
> var2 <- c(
+ -3.6635616, 3.6635616, 3.6635616, 3.6635616, -3.6635616, -0.8001193,
+ 3.6635616, 3.6635616, 3.6635616, 3.6635616, -3.6635616, 3.6635616,
+ 3.6635616, 3.6635616, 3.6635616, 3.6635616, 3.6635616, 3.6635616,
+ 3.6635616, 3.6635616, 3.6635616, 3.6635616, 3.6635616, -1.2950457,
+ -3.6635616
+ )
>
> # confirm that they're equal
> all.equal(var1, var2)
[1] TRUE
>
> # something suspicious
> var1[1]==var2[1]
[1] FALSE
>
> # histogram of var2 does include values = 3.6635616
> ggplot(data=NULL, aes(x=var2)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
And here's the corresponding (correct) histogram:
It also appears to be related to the number of bins. If I tinker with them, I can get it to show up:
> # if I mess with the bin number I can get it to show up
> ggplot(data=NULL, aes(x=var1)) + geom_histogram(bins=40) # no
> ggplot(data=NULL, aes(x=var1)) + geom_histogram(bins=41) # yes
What's going on?
Edit
Adding more info to try to make this reproducible.
> dput(var1)
c(-3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965,
-3.66356164612965, -0.800119300112113, 3.66356164612965, 3.66356164612965,
3.66356164612965, 3.66356164612965, -3.66356164612965, 3.66356164612965,
3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965,
3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965,
3.66356164612965, 3.66356164612965, 3.66356164612965, -1.29504568965475,
-3.66356164612965)
> sprintf("%a",var1)
[1] "-0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1"
[4] "0x1.d4ef968880dd4p+1" "-0x1.d4ef968880dd4p+1" "-0x1.99a93ca5c286dp-1"
[7] "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1"
[10] "0x1.d4ef968880dd4p+1" "-0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1"
[13] "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1"
[16] "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1"
[19] "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1"
[22] "0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1" "-0x1.4b881d43e494fp+0"
[25] "-0x1.d4ef968880dd4p+1"
Interestingly, even the dput
doesn't reproduce the issue:
> var3 = c(-3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965,
+ -3.66356164612965, -0.800119300112113, 3.66356164612965, 3.66356164612965,
+ 3.66356164612965, 3.66356164612965, -3.66356164612965, 3.66356164612965,
+ 3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965,
+ 3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965,
+ 3.66356164612965, 3.66356164612965, 3.66356164612965, -1.29504568965475,
+ -3.66356164612965)
> ggplot(data=NULL, aes(x=var3)) + geom_histogram()