3

I am on the lookout for a faster alternative to R's hist(x, breaks=XXX, plot=FALSE)$count function as I don't need any of the other output that is produced (as I want to use it in an sapply call, requiring 1 million iterations in which this function would be called), e.g.

x = runif(100000000, 2.5, 2.6)
bincounts = hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count

Any thoughts?

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
  • Maybe check out the code for `hist.default` and toss out the parts you don't need? For example, are there non-finite numbers? `hist` checks for that. – Zelazny7 Jul 18 '16 at 13:16
  • Well it's seeminlgy doing a call to .Call(C_BinCount, x, fuzzybreaks, right, include.lowest) - what would be the best way to call that from any regular script? I only have finite values. – Tom Wenseleers Jul 18 '16 at 13:25
  • In each iteration, are you creating a new `x` or using the same `x`? If `x` is the same during your `sapply` consider `sort`ing it at the start, as it, generally, will decrease the computational time in either of `hist`/`findInterval` – alexis_laz Jul 18 '16 at 15:08
  • Ha no it's all different vectors! – Tom Wenseleers Jul 18 '16 at 15:13

2 Answers2

5

A first attempt using table and cut:

table(cut(x, breaks=seq(0,3,length.out=100)))

It avoids the extra output, but takes about 34 seconds on my computer:

system.time(table(cut(x, breaks=seq(0,3,length.out=100))))
   user  system elapsed 
 34.148   0.532  34.696 

compared to 3.5 seconds for hist:

system.time(hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count)
   user  system elapsed 
  3.448   0.156   3.605

Using tabulate and .bincode runs a little bit faster than hist:

tabulate(.bincode(x, breaks=seq(0,3,length.out=100)), nbins=100)

system.time(tabulate(.bincode(x, breaks=seq(0,3,length.out=100))), nbins=100)
   user  system elapsed 
  3.084   0.024   3.107

Using tablulate and findInterval provides a significant performance boost relative to table and cut and has an OK improvement relative to hist:

tabulate(findInterval(x, vec=seq(0,3,length.out=100)), nbins=100)

system.time(tabulate(findInterval(x, vec=seq(0,3,length.out=100))), nbins=100)
   user  system elapsed 
  2.044   0.012   2.055
lmo
  • 37,904
  • 9
  • 56
  • 69
  • Yes but problem with this one is that it's suuuper slow... So that's not an option for me as I have to iteratite it 1 million times... – Tom Wenseleers Jul 18 '16 at 13:04
  • Yes. It seems that `hist` is about 10 times faster. I'm trying a number of alternatives right now to see if I can find any speed up. – lmo Jul 18 '16 at 13:07
  • Just ran the above solution with Microsoft R Open 3.2.5, the speed is much better. `user 1.34` `system 0.09` `elapsed 1.43 `. Just may be one option. – user5249203 Jul 18 '16 at 13:15
  • Perfect - thanks so much! tabulate findInterval it will be :-) – Tom Wenseleers Jul 18 '16 at 13:27
  • 1
    Ha sorry just notice a problem still - seem that in tabulate you still have to add the argument nbins=length(x) to match the output of hist() – Tom Wenseleers Jul 18 '16 at 13:33
  • I think it is `nbins=length(seq(0,3, length.out=100))`. I'll make the change so they line up. – lmo Jul 18 '16 at 14:00
  • Ha yes sorry that's it! Thx again for the help! – Tom Wenseleers Jul 18 '16 at 14:05
  • 1
    I get another 10% speed-up from cutting out overhead: `.Internal(tabulate(.Internal(findInterval(breaks, x, FALSE, FALSE)), 100L))` – MichaelChirico Jul 18 '16 at 14:27
  • Thanks for the update @MichaelChirico. I'll have to look into using the `.Internal` function. It looks like a clean way to pull out some extra speed. – lmo Jul 18 '16 at 14:31
  • 1
    just a matter of examining the code of `tabulate` and `findInterval` by entering `tabulate` on the console – MichaelChirico Jul 18 '16 at 14:38
3

Seems your best bet is to just cut out all the overhead of hist.default.

nB1 <- 99
delt <- 3/nB1
fuzz <- 1e-7 * c(-delt, rep.int(delt, nB1))
breaks <- seq(0, 3, by = delt) + fuzz

.Call(graphics:::C_BinCount, x, breaks, TRUE, TRUE)

I pared down to this by running debugonce(hist.default) to get a feel for exactly how hist works (and testing with a smaller vector -- n = 100 instead of 1000000).

Comparing:

x = runif(100, 2.5, 2.6)
y1 <- .Call(graphics:::C_BinCount, x, breaks + fuzz, TRUE, TRUE)
y2 <- hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count
identical(y1, y2)
# [1] TRUE
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • @TomWenseleers regardless it's strange that this is slower... why isn't `hist` just calling `tabulate(findInterval)` if it's better?? I'm wondering if there's significant overhead from using `graphics:::`... – MichaelChirico Jul 18 '16 at 13:51