2

I'm noticing some strange behavior with geom_histogram from ggplot2. It seems to leave out a bar and I can't figure out why. Here's an example:

> # show the data
> head(df)
  other_variable  variable
1              0  3.663562
2              0  3.663562
3              0  3.663562
4              0  3.663562
5              0 -3.663562
6              1 -3.663562
> 
> # select 25 random rows
> set.seed(1)
> var1 <- df[runif(25,0,nrow(df)),]$variable
> 
> # display the data
> var1
 [1] -3.6635616  3.6635616  3.6635616  3.6635616 -3.6635616 -0.8001193
 [7]  3.6635616  3.6635616  3.6635616  3.6635616 -3.6635616  3.6635616
[13]  3.6635616  3.6635616  3.6635616  3.6635616  3.6635616  3.6635616
[19]  3.6635616  3.6635616  3.6635616  3.6635616  3.6635616 -1.2950457
[25] -3.6635616
> 
> # histogram of var1 doesn't include values = 3.6635616
> ggplot(data=NULL, aes(x=var1)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram made in the last line seems to exclude the values of var1 in the highest category: enter image description here

What's strange is that it's difficult to reproduce. If I make the variable "manually", the correct bars show up, though I suspect that it has to do with significant digits:

> # make a new vector with the same data
> var2 <- c(
+ -3.6635616, 3.6635616, 3.6635616, 3.6635616, -3.6635616, -0.8001193, 
+  3.6635616, 3.6635616, 3.6635616, 3.6635616, -3.6635616, 3.6635616, 
+  3.6635616, 3.6635616, 3.6635616, 3.6635616, 3.6635616, 3.6635616, 
+  3.6635616, 3.6635616, 3.6635616, 3.6635616, 3.6635616, -1.2950457, 
+ -3.6635616
+ )
> 
> # confirm that they're equal
> all.equal(var1, var2)
[1] TRUE
> 
> # something suspicious
> var1[1]==var2[1]
[1] FALSE
> 
> # histogram of var2 does include values = 3.6635616
> ggplot(data=NULL, aes(x=var2)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

And here's the corresponding (correct) histogram: enter image description here

It also appears to be related to the number of bins. If I tinker with them, I can get it to show up:

> # if I mess with the bin number I can get it to show up
> ggplot(data=NULL, aes(x=var1)) + geom_histogram(bins=40) # no 
> ggplot(data=NULL, aes(x=var1)) + geom_histogram(bins=41) # yes

What's going on?


Edit

Adding more info to try to make this reproducible.

> dput(var1)
c(-3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965, 
-3.66356164612965, -0.800119300112113, 3.66356164612965, 3.66356164612965, 
3.66356164612965, 3.66356164612965, -3.66356164612965, 3.66356164612965, 
3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965, 
3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965, 
3.66356164612965, 3.66356164612965, 3.66356164612965, -1.29504568965475, 
-3.66356164612965)
> sprintf("%a",var1)
 [1] "-0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1" 
 [4] "0x1.d4ef968880dd4p+1"  "-0x1.d4ef968880dd4p+1" "-0x1.99a93ca5c286dp-1"
 [7] "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1" 
[10] "0x1.d4ef968880dd4p+1"  "-0x1.d4ef968880dd4p+1" "0x1.d4ef968880dd4p+1" 
[13] "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1" 
[16] "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1" 
[19] "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1" 
[22] "0x1.d4ef968880dd4p+1"  "0x1.d4ef968880dd4p+1"  "-0x1.4b881d43e494fp+0"
[25] "-0x1.d4ef968880dd4p+1"

Interestingly, even the dput doesn't reproduce the issue:

> var3 = c(-3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965, 
+ -3.66356164612965, -0.800119300112113, 3.66356164612965, 3.66356164612965, 
+ 3.66356164612965, 3.66356164612965, -3.66356164612965, 3.66356164612965, 
+ 3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965, 
+ 3.66356164612965, 3.66356164612965, 3.66356164612965, 3.66356164612965, 
+ 3.66356164612965, 3.66356164612965, 3.66356164612965, -1.29504568965475, 
+ -3.66356164612965)
> ggplot(data=NULL, aes(x=var3)) + geom_histogram()

enter image description here

dmp
  • 815
  • 1
  • 6
  • 19
  • 1
    Is there any way you can make this [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? Maybe do a `dput()` of the original data. Or `sprintf("%a",x)` to get the "true" representations of the decimal numbers. – MrFlick Aug 12 '16 at 20:05
  • Thanks for your comment. I wasn't aware of those functions. I edited the question but am still failing to make something reproducible... any other suggestions? – dmp Aug 12 '16 at 20:13
  • There isn't much to do if it isn't reproducible. At least the `dput` solved your problem. – Axeman Aug 12 '16 at 20:22
  • Unfortunately, no it didn't. `var3 = dput(var1)` doesn't solve it even though pasting in the output from `dput(var1)` does... any other ideas on how I can make this reproducible? – dmp Aug 12 '16 at 20:28
  • They changed a bunch of the binning code not too long ago, if I remember correctly. Check to see if your running the latest version. Since your problem isn't reproducible, filing a bug report isn't going to be very useful. – Axeman Aug 12 '16 at 20:59

0 Answers0