14

I have a lot of ranges in the form [(1, 1000), (5000, 5678), ... ]. I'm trying to figure out the fastest way to check if a number is within any of the ranges. The ranges are made up of longs and are too large to just keep a set of all the numbers.

The simplest solution is this:

ranges = [(1,5), (10,20), (40,50)]  # The real code has a few dozen ranges
nums = range(1000000)  
%timeit [n for n in nums if any([r[0] <= n <= r[1] for r in ranges])]
# 1 loops, best of 3: 5.31 s per loop

Banyan is a bit faster:

import banyan
banyan_ranges = banyan.SortedSet(updator=banyan.OverlappingIntervalsUpdator)
for r in ranges:
    banyan_ranges.add(r)
%timeit [n for n in nums if len(banyan_ranges.overlap_point(n))>0]
# 1 loops, best of 3: 452 ms per loop

Although there are only a few dozen ranges, there are millions of checks against those ranges. What's the fastest way to do these checks?

(Note: This question is similar to Python: efficiently check if integer is within *many* ranges but does not have the same Django-related restrictions and is exclusively concerned with speed)

Community
  • 1
  • 1
Thomas Johnson
  • 10,776
  • 18
  • 60
  • 98
  • are your ranges sorted to begin with? – roippi May 13 '14 at 18:57
  • No, but sorting would be a minimal cost compared to the checking time – Thomas Johnson May 13 '14 at 18:58
  • alright, next question: are any overlapping? :-) – roippi May 13 '14 at 18:59
  • Yes, they can be overlapping – Thomas Johnson May 13 '14 at 18:59
  • "A few dozen" isn't a whole lot. Possibly not even enough to justify a binary search, let alone justify a complicated tree data structure. Edit: In Python 3 you could convert them into `range` objects (take care that those are half-open but your ranges are closed), which might speed up the checking. –  May 13 '14 at 19:00
  • 1
    There's only a few dozen ranges, but millions of checks against those ranges – Thomas Johnson May 13 '14 at 19:01
  • @user939259 I get that, my point is that a data structure that makes the check against the ranges asymptotically faster (e.g. logarithmic in the number of ranges) would likely yield no benefit because each check has higher constant factors. –  May 13 '14 at 19:03
  • 1
    Are the ranges stored in a DB? These things are good at this. – Jochen Ritzel May 13 '14 at 19:05
  • @JochenRitzel No it's all in-memory – Thomas Johnson May 13 '14 at 19:06
  • @JochenRitzel DBs are good at making queries scale. They are not good at making queries on tiny tables faster than naive loops over in-memory data. Unfortunately the latter is requires. –  May 13 '14 at 19:07
  • 2
    From an algorithm standpoint, you're best converting your list of _overlapping_ ranges to a list of _non-overlapping_ ranges. Then you can `bisect` your way to the closest range and check only that one. That would take your O(N*M) algorithm to an O(N * log(M)) algorithm. In practice, it's hard to say if this really gains you much (expecially for small M) – mgilson May 13 '14 at 19:09
  • 2
    @mgilson, you don't really need to split the range. Just sort them, first by the lower bound, then by the upper bound. Next, binary search first range that's upper bound is lower than value being checked. Then, get the next range in the set and compare the lower bound (It's the lowest one by the ordering). – Maciej Gol May 13 '14 at 19:16
  • 1
    The fastest way I could think of would be to turn the few dozen checks into a loop-less function with `exec`, with the checks sorted in some order (maybe binary-tree-like), and then execute the whole thing in PyPy (which is very happy with this kind of code, as long as it's not too long). – Armin Rigo May 13 '14 at 19:24
  • `any(cond for ... in ...)` should be faster than `any([...])` because it doesn't construct a throw-away list. – Erik Kaplun May 13 '14 at 19:36
  • @kroolik -- True, but condensing the ranges decreases the value of `M`. It's a pretty minor benefit I suppose, but it also costs little more than sorting in the first place. – mgilson May 13 '14 at 19:36
  • @ErikAllik -- You're right, but the biggest reason is because it allows `any` to short circuit (you don't need to look at any of the items after the first True one), not because you're constructing a list which you then throw away. – mgilson May 13 '14 at 19:37
  • well yes; I wasn't detailed enough. :) – Erik Kaplun May 13 '14 at 19:38

3 Answers3

10

Things to try:

  1. Pre-process your ranges so that they're non-overlapping, and express them as half-open intervals.
  2. Use the bisect module to do the searching. (Don't implement your own bisection search by hand!) Note that with the preprocessing in 1, all you'll need to know is whether the result of the bisect call is even or odd.
  3. If batching the queries is an option, consider grouping your inputs into an array and using numpy.searchsorted.

Some code and timings. First the setup (here using IPython 2.1 and Python 3.4):

In [1]: ranges = [(1, 5), (10, 20), (40, 50)]

In [2]: nums = list(range(1000000))  # force a list to remove generator overhead

Timings for the original method on my machine (but with a generator expression instead of a list comprehension):

In [3]: %timeit [n for n in nums if any(r[0] <= n <= r[1] for r in ranges)]
1 loops, best of 3: 922 ms per loop

Now we rework the ranges as a list of boundary points; each boundary point at an even index is an entry point to one of the ranges, while each boundary point at an odd index is an exit point. Note the conversion to half-open intervals, and that I've put all the numbers into a single list.

In [4]: boundaries = [1, 6, 10, 21, 40, 51]

With this it's easy to use bisect.bisect to get the same results as before, but rather faster.

In [5]: from bisect import bisect

In [6]: %timeit [n for n in nums if bisect(boundaries, n) % 2]
1 loops, best of 3: 298 ms per loop

Finally, depending on the context, you may be able to make use of the searchsorted function from NumPy. This is like bisect.bisect, but operates on whole collections of values at once. For example:

In [7]: import numpy

In [8]: numpy.where(numpy.searchsorted(boundaries, nums, side="right") % 2)[0]
Out[8]: 
array([ 1,  2,  3,  4,  5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 40,
       41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

At first glance, the %timeit results from this are rather disappointing.

In [9]: %timeit numpy.where(numpy.searchsorted(boundaries, nums, side="right") % 2)[0]
10 loops, best of 3: 159 ms per loop

However, it turns out that much of the performance cost is in converting the inputs to searchsorted from Python lists to NumPy arrays. Let's preconvert both lists to arrays and try again:

In [10]: boundaries = numpy.array(boundaries)

In [11]: nums = numpy.array(nums)

In [12]: %timeit numpy.where(numpy.searchsorted(boundaries, nums, side="right") % 2)[0]
10 loops, best of 3: 24.6 ms per loop

Much faster than anything else so far. However, this is cheating a bit: we can certainly preprocess boundaries to turn it into an array, but if the values you want to test aren't naturally produced in array form then the cost of conversion will need to be taken into account. On the other hand, it shows that the cost of the search itself can be reduced to a small enough value that it's no longer likely to be the dominant factor in the running time.

Here's another option along these lines. It uses NumPy again, but does a direct non-lazy linear search per value. (Please forgive the out-of-order IPython prompts: I added this one later. :-)

In [29]: numpy.where(numpy.logical_xor.reduce(numpy.greater_equal.outer(boundaries, nums), axis=0))
Out[29]: 
(array([ 2,  3,  4,  5,  6, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 41,
        42, 43, 44, 45, 46, 47, 48, 49, 50, 51]),)

In [30]: %timeit numpy.where(numpy.logical_xor.reduce(numpy.greater_equal.outer(boundaries, nums), axis=0))
10 loops, best of 3: 16.7 ms per loop

For these particular test data, this is faster than searchsorted, but the time will grow linearly in the number of ranges, while for searchsorted it should grow according to the log of the number of ranges. Note that it also uses an amount of memory proportional to len(boundaries) * len(nums). This isn't necessarily a problem: if you find yourself bumping into memory constraints you can probably chunk the arrays into smaller sizes (say 10000 elements at a time) without losing too much performance.

Moving up the scale, if none of these fits the bill I'd next try Cython and NumPy, writing a search function (with inputs declared as arrays of ints) that does a simple linear search on the boundaries array. I tried this, but failed to get results better than those based on bisect.bisect. For reference, here's the Cython code I tried; you may be able to do better:

cimport cython

cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def search(np.ndarray[long, ndim=1] boundaries, long val):
    cdef long j, k, n=len(boundaries)
    for j in range(n):
        if boundaries[j] > val:
           return j & 1
    return 0

And the timings:

In [13]: from my_cython_extension import search

In [14]: %timeit [n for n in nums if search(boundaries, n)]
1 loops, best of 3: 793 ms per loop
Mark Dickinson
  • 29,088
  • 9
  • 83
  • 120
2

An implementation of @ArminRigo's comment, which is pretty fast. The timing is from CPython, not PyPy:

exec_code = "def in_range(x):\n"
first_if = True
for r in ranges:
   if first_if:
      exec_code += "    if "
      first_if = False
   else:
      exec_code += "    elif "
   exec_code += "%d <= x <= %d: return True\n" % (r[0], r[1])
exec_code += "    return False"
exec(exec_code)

%timeit [n for n in nums if in_range(n)]
# 10 loops, best of 3: 173 ms per loop
Thomas Johnson
  • 10,776
  • 18
  • 60
  • 98
1

Try to use binary search instead of linear. It should spend "Log(n)" in time. See below:

list = []
for num in nums:
    start = 0
    end = len(ranges)-1
    if ranges[start][0] <= num <= ranges[start][1]:
        list.append(num)
    elif ranges[end][0] <= num <= ranges[end][1]:
        list.append(num):
    else:
        while end-start>1:
            mid = int(end+start/2)
            if ranges[mid][0] <= num <= ranges[mid][1]:
                list.append(num)
                break
            elif num < ranges[mid][0]:
                end = mid
            else:
                start = mid
Emisilve86
  • 354
  • 2
  • 8
  • This seems really slow - much slower than either of the implementations in the question. I killed the %timeit call after a couple minutes. I'm not sure why it's so slow - maybe it's the incremental building of the list or the for-loop. – Thomas Johnson May 13 '14 at 19:46
  • Strange! Probably any() is optimized to do just this. But if you confirm that nums is always a list generated by mean range(), you may invert the problem and for each range in ranges pick all that numbers in subrange/range that fall in the nums list. This will be really linear. – Emisilve86 May 14 '14 at 10:34