20

The great findInterval() function in R uses left-closed sub-intervals in its vec argument, as shown in its docs:

if i <- findInterval(x,v), we have v[i[j]] <= x[j] < v[i[j] + 1]

If I want right-closed sub-intervals, what are my options? The best I've come up with is this:

findInterval.rightClosed <- function(x, vec, ...) {
  fi <- findInterval(x, vec, ...)
  fi - (x==vec[fi])
}

Another one also works:

findInterval.rightClosed2 <- function(x, vec, ...) {
  length(vec) - findInterval(-x, -rev(vec), ...)
}

Here's a little test:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
findInterval(x, vec)
# [1] 1 3 3 3 3 4 4
findInterval.rightClosed(x, vec)
# [1] 1 2 3 3 3 4 4
findInterval.rightClosed2(x, vec)
# [1] 1 2 3 3 3 4 4

But I'd like to see any other solutions if there's a better one. By "better", I mean "somehow more satisfying" or "doesn't feel like a kludge" or maybe even "more efficient". =)

(Note that there's a rightmost.closed argument to findInterval(), but it's different - it only refers to the final sub-interval and has a different meaning.)

Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • What do you think about: `findInterval(x, c(-Inf, head(vec, -1)))`? – sgibb Nov 20 '12 at 22:06
  • @sgibb that doesn't seem to do the trick, I added an example and yours doesn't give the same result. – Ken Williams Nov 20 '12 at 22:18
  • I'm a bit befuddled here, but does `findInterval(x-1,vec)` do what you are looking for? – thelatemail Nov 20 '12 at 22:21
  • @thelatemail, as long as it's just integers? – BenBarnes Nov 20 '12 at 22:27
  • I think it is ok to post answers to your own question. My vote would go to your `findInterval.rightClosed2`. – flodel Nov 20 '12 at 23:40
  • Another desirable property would be that it works with types other than numeric (e.g. `POSIXt` objects, but ideally anything with comparison operators), as `findInterval` does (`findInterval` works with anything that supports an `as.double` method). The first function above succeeds, but the second fails, because there's no way to negate `POSIXt` objects. – Ken Williams Nov 26 '12 at 22:51

3 Answers3

10

EDIT: Major clean-up in all aisles.

You might look at cut. By default, cut makes left open and right closed intervals, and that can be changed using the appropriate argument (right). To use your example:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
cutVec <- c(vec, max(x)) # for cut, range of vec should cover all of x

Now create four functions that should do the same thing: Two from the OP, one from Josh O'Brien, and then cut. Two arguments to cut have been changed from default settings: include.lowest = TRUE will create an interval closed on both sides for the smallest (leftmost) interval. labels = FALSE will cause cut to return simply the integer values for the bins instead of creating a factor, which it otherwise does.

findInterval.rightClosed <- function(x, vec, ...) {
  fi <- findInterval(x, vec, ...)
  fi - (x==vec[fi])
}
findInterval.rightClosed2 <- function(x, vec, ...) {
  length(vec) - findInterval(-x, -rev(vec), ...)
}
cutFun <- function(x, vec){
    cut(x, vec, include.lowest = TRUE, labels = FALSE)
}
# The body of fiFun is a contribution by Josh O'Brien that got fed to the ether.
fiFun <- function(x, vec){
    xxFI <- findInterval(x, vec * (1 + .Machine$double.eps))
}

Do all functions return the same result? Yup. (notice the use of cutVec for cutFun)

mapply(identical, list(findInterval.rightClosed(x, vec)),
  list(findInterval.rightClosed2(x, vec), cutFun(x, cutVec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

Now a more demanding vector to bin:

x <- rpois(2e6, 10)
vec <- c(-Inf, quantile(x, seq(.2, 1, .2)))

Test whether identical (note use of unname)

mapply(identical, list(unname(findInterval.rightClosed(x, vec))),
  list(findInterval.rightClosed2(x, vec), cutFun(x, vec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

And benchmark:

library(microbenchmark)
microbenchmark(findInterval.rightClosed(x, vec), findInterval.rightClosed2(x, vec),
  cutFun(x, vec), fiFun(x, vec), times = 50)
# Unit: milliseconds
#                                expr       min        lq    median        uq       max
# 1                    cutFun(x, vec)  35.46261  35.63435  35.81233  36.68036  53.52078
# 2                     fiFun(x, vec)  51.30158  51.69391  52.24277  53.69253  67.09433
# 3  findInterval.rightClosed(x, vec) 124.57110 133.99315 142.06567 155.68592 176.43291
# 4 findInterval.rightClosed2(x, vec)  79.81685  82.01025  86.20182  95.65368 108.51624

From this run, cut seems to be the fastest.

BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • Thanks. I seem to recall that `cut` is less efficient, and it also doesn't allow trailing off the right edge of `vec` like `findInterval` does (see rows 15-17 in your output), but that part could be worked around by adding an `Inf` at the end. – Ken Williams Nov 20 '12 at 22:21
  • @KenWilliams, yeah, covering the whole range of `x` with `breaks` allows you to cut up all of `x` (see my edit, too). As to the efficiency, well, the code already exists at least. – BenBarnes Nov 20 '12 at 22:26
  • @KenWilliams, if you count speed in efficiency benchmarking, there might not be all that much of a difference between `findInterval` and `cut` after all. Perhaps the slowdown in `cut` usually comes from conversion to a factor and label-making? – BenBarnes Nov 20 '12 at 22:43
  • For completeness, could you include the OP's solutions as well in your timings...? – joran Nov 20 '12 at 23:15
  • @KenWilliams, if you wish, please see the major update above with (speed) benchmarking. – BenBarnes Nov 21 '12 at 06:49
  • Surprising results! I get similar numbers on my machine. It looks like I should switch my code to `cutFun`. And maybe the docs for `cut` need to be updated to remove the disclaimer about speed. – Ken Williams Nov 21 '12 at 14:20
  • @KenWilliams, I was surprised, too. And I just found `news(grepl("^PERF", Category) & grepl("cut",Text))`, so the docs very well might be out of date. – BenBarnes Nov 21 '12 at 14:29
  • One subtle difference to note: `findInterval` only requires that the cutpoints are non-decreasing, but `cut` requires that they're strictly increasing. Working around that would probably make `cutFun` lose the speed battle. – Ken Williams Nov 26 '12 at 22:53
  • @eddi, could you provide an example of `fiFun` failing? The couple of examples I ran seemed to return expected results. – BenBarnes Jul 01 '13 at 12:23
  • I still prefer OP's solutions because of e.g. this thread: http://stackoverflow.com/questions/16184947/cut-error-breaks-are-not-unique – user3032689 Dec 28 '15 at 18:52
4

Maybe you can use the option left.open:

findInterval(x, vec, left.open=T)
[1] 1 2 3 3 3 4 4
dabsingh
  • 281
  • 1
  • 3
  • 10
-1

If your limits are intervals you simply can grow the right interval a bit: interval+c(0,0.1) would do: findinterval(value, interval+c(0,0.1))

  • 1
    This won't work for a variety of reasons. Most fundamentally, recycling will cause this to change the right interval on every odd entry, not all of them. Further, this assumes that you know that the discretization of the data is wider than 0.1. – Matthew Lundberg Aug 17 '16 at 05:21