45

I am working on a postage application which is required to check an integer postcode against a number of postcode ranges, and return a different code based on which range the postcode matches against.

Each code has more than one postcode range. For example, the M code should be returned if the postcode is within the ranges 1000-2429, 2545-2575, 2640-2686 or is equal to 2890.

I could write this as:

if 1000 <= postcode <= 2429 or 2545 <= postcode <= 2575 or 2640 <= postcode <= 2686 or postcode == 2890:
    return 'M'

but this seems like a lot of lines of code, given that there are 27 returnable codes and 77 total ranges to check against. Is there a more efficient (and preferably more concise) method of matching an integer to all these ranges using Python?


Edit: There's a lot of excellent solutions flying around, so I have implemented all the ones that I could, and benchmarked their performances.

The environment for this program is a web service (Django-powered actually) which needs to check postcode region codes one-by-one, on the fly. My preferred implementation, then, would be one that can be quickly used for each request, and does not need any process to be kept in memory, or needs to process many postcodes in bulk.

I tested the following solutions using timeit.Timer with default 1000000 repetitions using randomly generated postcodes each time.

IF solution (my original)

if 1000 <= postcode <= 2249 or 2555 <= postcode <= 2574 or ...:
    return 'M'
if 2250 <= postcode <= 2265 or ...:
    return 'N'
...

Time for 1m reps: 5.11 seconds.

Ranges in tuples (Jeff Mercado)

Somewhat more elegant to my mind and certainly easier to enter and read the ranges. Particularly good if they change over time, which is possible. But it did end up four times slower in my implementation.

if any(lower <= postcode <= upper for (lower, upper) in [(1000, 2249), (2555, 2574), ...]):
    return 'M'
if any(lower <= postcode <= upper for (lower, upper) in [(2250, 2265), ...]):
    return 'N'
...

Time for 1m reps: 19.61 seconds.

Set membership (gnibbler)

As stated by the author, "it's only better if you are building the set once to check against many postcodes in a loop". But I thought I would test it anyway to see.

if postcode in set(chain(*(xrange(start, end+1) for start, end in ((1000, 2249), (2555, 2574), ...)))):
    return 'M'
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((2250, 2265), ...)))):
    return 'N'
...

Time for 1m reps: 339.35 seconds.

Bisect (robert king)

This one may have been a bit above my intellect level. I learnt a lot reading about the bisect module but just couldn't quite work out which parameters to give find_ge() to make a runnable test. I expect that it would be extremely fast with a loop of many postcodes, but not if it had to do the setup each time. So, with 1m repetitions of filling numbers, edgepairs, edgeanswers etc for just one postal region code (the M code with four ranges), but not actually running the fast_solver:

Time for 1m reps: 105.61 seconds.

Dict (sentinel)

Using one dict per postal region code pre-generated, cPickled in a source file (106 KB), and loaded for each run. I was expecting much better performance from this method, but on my system at least, the IO really destroyed it. The server is a not-quite-blindingly-fast-top-of-the-line Mac Mini.

Time for 1m reps: 5895.18 seconds (extrapolated from a 10,000 run).

The summary

Well, I was expecting someone to just give a simple 'duh' answer that I hadn't considered, but it turns out this is much more complicated (and even a little controversial).

If every nanosecond of efficiency counted in this case, I would probably keep a separate process running which implemented one of the binary search or dict solutions and kept the result in memory for an extremely fast lookup. However, since the IF tree takes only five seconds to run a million times, which is plenty fast enough for my small business, that's what I'll end up using in my application.

Thank you to everyone for contributing!

tobygriffin
  • 5,339
  • 4
  • 36
  • 61
  • Ah, so some ranges map to different letters? – John La Rooy May 19 '11 at 05:44
  • 2
    The set answer is good, but all things told, you should test the performance. It might very well be that the `if` branches are faster than set inclusion (and they definitely take less memory). – Nick Bastin May 19 '11 at 06:40
  • 1
    If your intervals do not overlap a binary search should be plenty efficient. If your intervals overlap and you have lots of them, read up on [Interval trees](http://en.wikipedia.org/wiki/Interval_tree). – samplebias May 19 '11 at 22:03
  • What do you with your "if tree" when the lookup table changes? – John Machin May 20 '11 at 02:02
  • I would manually change the code. Because the lookup table is just a paper postage contract, whichever solution is used would involve some manual typing of numbers whenever that changed regardless. – tobygriffin May 20 '11 at 02:19
  • 1
    @tobygriffin: need developer to change code, only a user needed to edit datafile. – John Machin May 20 '11 at 02:40
  • 3
    Regards my solution with the bisect, You need to only do the pre-processing once! once you have edgepairsanswers and edgeanswers sitting in memory, you only need to run fast_solver each time (the whole point is that you are using pre-calculated results rather than recalculating the results each time) – Rusty Rob May 20 '11 at 12:15

11 Answers11

19

You can throw your ranges into tuples and put the tuples in a list. Then use any() to help you find if your value is within these ranges.

ranges = [(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
if any(lower <= postcode <= upper for (lower, upper) in ranges):
    print('M')
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • 1
    Why an upvote for a solution with same horrible efficiency like the code by the OP? Sorry, not an appropriate solution for the problem! –  May 19 '11 at 05:39
  • 2
    @Sentinel: Not an appropriate solution? How so? Is there some magic hashing function that can turn anything like this into an O(1) solution in a reasonable amount of space? I don't buy it. This is probably the best its gonna get with a reasonable compromise. _If_ there's such a solution that can be more efficient in terms of speed and efficiency, then by all means, be my guest and share it with us. – Jeff Mercado May 19 '11 at 05:51
  • @jeff: your solution is nothing but code-rewriting in the same horrible but a little more elegant way - nothing more, nothing less. Look at my anwer. A hash-based solution is both efficient and better maintainable than having hard-coded ranges here. Memory does matter, memory is cheap...nobody needs yet another runtime-inefficient solution this yours or the one of the OP. There is a slighly difference between O(1) and O(n)... –  May 19 '11 at 05:56
  • @Sentinel: Answer me this then, suppose we had the ranges `[(-5000000, 5000000), (6000000, 100000000000)]`, would your solution be a viable one? Not likely. The point is that there is a compromise made here for both of us and I went for memory efficiency. How can you argue with that? – Jeff Mercado May 19 '11 at 06:01
  • We don't have that large ranges *here*. Postcodes are usually max 5 chars long! –  May 19 '11 at 06:08
  • Rewriting code by reformatting it does not deserve any upvotes. –  May 19 '11 at 06:08
  • @Sentinel - there's such a thing as premature optimisation. This is clear and convenient code, and for a reasonably small number of ranges, quite fast. One thing about O(n) linear searches is that, unlike some asymptotically superior solutions, they use cache very efficiently. A hashtable solution could be slower - for repeated searches, the lookups can have poor locality, and for a small number of searches the cost of building the hashtable outweighs any gains. My binary search solution could also easily be slower. –  May 19 '11 at 06:30
  • This is an unbelievably horrible solution. Sets are not any less clear, and *way* faster. – Nick Bastin May 19 '11 at 06:38
  • @Steve314: you should do some research - there isn't a range in the world in python that uses cache efficiently. Integer objects are *huge* compared to other languages where you might care about being efficient with cache. – Nick Bastin May 19 '11 at 06:39
  • @Nick Bastin - depends what order you allocate them in. Allocating them together at the same time (providing the heap isn't too fragmented) they'll tend to be clustered together. Also, small integers (I don't know how small) are given special treatment - interned at the very least. In any case, the references will have good locality in a linear search (where they may not for alternatives), even if the integers themselves don't. As for being huge - they are optimised for (I think) 32 bit integers, so not so much unless you need long integers. And cache pages aren't *that* small. –  May 19 '11 at 07:00
  • @Steve314: The interned integers are 0-100. Depending on your version of python they are no longer optimized for short integers (32-bits or less). If you wanted something truly optimal for postcodes you'd implement a 17-bit set (although again, I suggest that someone should test the branches versus set search, as the branches may be faster, despite their considerable expense). – Nick Bastin May 19 '11 at 07:55
9

Here is a fast and short solution, using numpy:

import numpy as np
lows = np.array([1, 10, 100]) # the lower bounds
ups = np.array([3, 15, 130]) # the upper bounds

def in_range(x):
    return np.any((lows <= x) & (x <= ups))

Now for instance

in_range(2) # True
in_range(23) # False
Olivier Verdier
  • 46,998
  • 29
  • 98
  • 90
8

Probably the fastest will be to check the membership of a set

>>> from itertools import chain
>>> ranges = ((1000, 2429), (2545, 2575), (2640, 2686), (2890, 2890))
>>> postcodes = set(chain(*(xrange(start, end+1) for start, end in ranges)))
>>> 1000 in postcodes
True
>>> 2500 in postcodes
False

But it does use more memory this way, and building the set takes time, so it's only better if you are building the set once to check against many postcodes in a loop

EDIT: seems that different ranges need to map to different letters

>>> from itertools import chain
>>> ranges = {'M':((1000,2429), (2545,2575), (2640,2686), (2890, 2890)),
              # more ranges
              }
>>> postcodemap = dict((k,v) for v in ranges for k in chain(*imap(xrange, *zip(*ranges[v]))))    
>>> print postcodemap.get(1000)
M
>>> print postcodemap.get(2500)
None
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • 1
    You might want to use the class method [`chain.from_iterable()`](http://docs.python.org/library/itertools.html#itertools.chain.from_iterable) instead that way you're not going through the values prematurely in the argument expansion. – Jeff Mercado May 19 '11 at 06:13
5

you only have to solve for edge cases and for one number between edge cases when doing inequalities.

e.g. if you do the following tests on TEN:

10 < 20, 10 < 15, 10 > 8, 10 >12

It will give True True True False

but note that the closest numbers to 10 are 8 and 12

this means that 9,10,11 will give the answers that ten did.. if you don't have too many initial range numbers and they are sparse then this well help. Otherwise you will need to see if your inequalities are transitive and use a range tree or something.

So what you can do is sort all of your boundaries into intervals. e.g. if your inequalities had the numbers 12, 50, 192,999

you would get the following intervals that ALL have the same answer: less than 12, 12, 13-49, 50, 51-191, 192, 193-998, 999, 999+

as you can see from these intervals we only need to solve for 9 cases and we can then quickly solve for anything.

Here is an example of how I might carry it out for solving for a new number x using these pre-calculated results:

a) is x a boundary? (is it in the set) if yes, then return the answer you found for that boundary previously. otherwise use case b)

b) find the maximum boundary number that is smaller than x, call it maxS find the minimum boundary number that is larger than x call it minL. Now just return any previously found solution that was between maxS and minL.

see Python binary search-like function to find first number in sorted list greater than a specific value for finding closest numbers. bisect module will help (import it in your code) This will help finding maxS and minL

You can use bisect and the function i have included in my sample code:

def find_ge(a, key):
    '''Find smallest item greater-than or equal to key.
    Raise ValueError if no such item exists.
    If multiple keys are equal, return the leftmost.

    '''
    i = bisect_left(a, key)
    if i == len(a):
        raise ValueError('No item found with key at or above: %r' % (key,))
    return a[i]




ranges=[(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
numbers=[]
for pair in ranges:
        numbers+=list(pair)

numbers+=[-999999,999999] #ensure nothing goes outside the range
numbers.sort()
edges=set(numbers)

edgepairs={}

for i in range(len(numbers)-1):
        edgepairs[(numbers[i],numbers[i+1])]=(numbers[i+1]-numbers[i])//2



def slow_solver(x):
        return #your answer for postcode x


listedges=list(edges)
edgeanswers=dict(zip(listedges,map(solver,listedges)))
edgepairsanswers=dict(zip(edgepairs.keys(),map(solver,edgepairs.values())))

#now we are ready for fast solving:
def fast_solver(x):
        if x in edges:
                return edgeanswers[x]
        else:
                #find minL and maxS using find_ge and your own similar find_le
                return edgepairsanswers[(minL,maxS)]
Community
  • 1
  • 1
Rusty Rob
  • 16,489
  • 8
  • 100
  • 116
4

The full data isn't there, but I'm assuming the ranges are non-overlapping, so you can express your ranges as a single sorted tuple of ranges, along with their codes:

ranges = (
    (1000, 2249, 'M'), 
    (2250, 2265, 'N'), 
    (2555, 2574, 'M'),
    # ...
)

This means we can binary search over them in one go. This should be O(log(N)) time, which should result in pretty decent performance with very large sets.

def code_lookup(value, ranges):
    left, right = 0, len(ranges)

    while left != right - 1:
        mid = left + (right - left) // 2

        if value <= ranges[mid - 1][1]:  # Check left split max
            right = mid
        elif value >= ranges[mid][0]:    # Check right split min
            left = mid
        else:                            # We are in a gap
            return None

    if ranges[left][0] <= value <= ranges[left][1]:
        # Return the code
        return ranges[left][2]

I don't have your exact values, but for comparison I ran it against some generated ranges (77 ranges with various codes) and compared it to a naive approach:

def get_code_naive(value):
    if 1000 < value < 2249:
        return 'M'
    if 2250 < value < 2265:
        return 'N'
    # ...

The result for 1,000,000 was that the naive version ran in about 5 sec and the binary search version in 4 sec. So it's a bit faster (20%), the codes are a lot nicer to maintain and the longer the list gets, the more it will out-perform the naive method over time.

Jon Betts
  • 3,053
  • 1
  • 14
  • 12
  • If you are using python3, you should change the `mid` calculation to `mid = left + (right - left) // 2` – rpadovani Nov 15 '19 at 10:19
  • This has to be the best answer algorithmically. Reducing the the amount of actual tests period. How could the "any()" method be any faster? Sure maybe runs in native code so it just does tests faster. But your algorithm potentially eliminates many tests. – Sirmabus Apr 14 '23 at 13:26
3

Recently I had a similar requirement and I used bit manipulation to test if an integer belongs to said range. It is definitely faster, but I guess not suitable if your ranges involve huge numbers. I liberally copied example methods from here

First we create a binary number which will have all bits in the range set to 1.

#Sets the bits to one between lower and upper range 
def setRange(permitRange, lower, upper):
  # the range is inclusive of left & right edge. So add 1 upper limit
  bUpper = 1 << (upper + 1)
  bLower = 1 << lower
  mask = bUpper - bLower
  return (permitRange | mask)

#For my case the ranges also include single integers. So added method to set single bits
#Set individual bits  to 1
def setBit(permitRange, number):
  mask = 1 << vlan
  return (permitRange| mask)

Now time to parse the range and populate our binary mask. If the highest number in the range is n, we will be creating integer greater than 2^n in binary

#Example range (10-20, 25, 30-50)
rangeList = "10-20, 25, 30-50"
maxRange = 100
permitRange = 1 << maxRange
for range in rangeList.split(","):
    if range.isdigit():
        permitRange = setBit(permitRange, int(range))
    else:
        lower, upper = range.split("-",1)
        permitRange = setRange(permitRange, int(lower), int(upper))
    return permitRange

To check if a number 'n' belongs to the range, simply test the bit at n'th position

#return a non-zero result, 2**offset, if the bit at 'offset' is one.
def testBit(permitRange, number):
    mask = 1 << number
    return (permitRange & mask)

if testBit(permitRange,10):
    do_something()
Benny
  • 639
  • 3
  • 11
  • 25
1

Python has a range(a, b) function which means the range from (and including) a, to (but excluding) b. You can make a list of these ranges and check to see if a number is in any of them. It may be more efficient to use xrange(a, b) which has the same meaning but doesn't actually make a list in memory.

list_of_ranges = []
list_of_ranges.append(xrange(1000, 2430))
list_of_ranges.append(xrange(2545, 2576))
for x in [999, 1000, 2429, 2430, 2544, 2545]:
    result = False
    for r in list_of_ranges:
        if x in r:
            result = True
            break
    print x, result
karmakaze
  • 34,689
  • 1
  • 30
  • 32
  • I think that would be a step in the wrong direction. Why turn a O(1) operation of checking if a value is within a range into a O(n) operation? – Jeff Mercado May 19 '11 at 05:21
  • Absolutely, my bad. I thought python could know the start/end of the xrange and do something clever. Alas no, at least not in 2.6. – karmakaze May 19 '11 at 05:32
  • @Jeff, if the code was for Python 3.0 then this becomes a sensible answer (provided you go back to just using `range()`). In Python 3.0 `x in range(lo,hi)` is both O(1) and space efficient. – Duncan May 19 '11 at 09:12
1

Warning - This is probably premature optimisation. For a large list of ranges it might be worthwhile, but probably not in your case. Also, although dictionary/set solutions will use more memory, they are still probably a better choice.

You could do a binary-search into your range end-points. This would be easy if all ranges are non-overlapping, but could still be done (with some tweaks) for overlapping ranges.

Do a find-highest-match-less-than binary search. This is the same as a find-lowest-match-greater-than-or-equal (lower bound) binary search, except that you subtract one from the result.

Use half-open items in your list of end points - that is if your range is 1000..2429 inclusive, use the values 1000 and 2430. If you get an end-point and a start-point with the same value (two ranges touching, so there is no gap between) exclude the end-point for the lower range from your list.

If you find a start-of-range end-point, your goal value is within that range. If you find an end-of-range end-point, your goal value isn't in any range.

The binary search algorithm is roughly (don't expect this to run without editing)...

while upperbound > lowerbound :
  testpos = lowerbound + ((upperbound-lowerbound) // 2)

  if item [testpos] > goal :
    #  new best-so-far
    upperbound = testpos
  else :
    lowerbound = testpos + 1

Note - the "//" division operator is necessary for integer division in Python 3. In Python 2, the normal "/" will work, but it's best to be ready for Python 3.

At the end, both upperbound and lowerbound point to the found item - but for the "upper bound" search. Subtract one to get the required search result. If that gives -1, there is no matching range.

There's probably a binary search routine in the library that does the upper-bound search, so prefer that to this if so. To get a better understanding of how the binary search works, see How can I better understand the one-comparison-per-iteration binary search? - no, I'm not above begging for upvotes ;-)

Community
  • 1
  • 1
  • A binary search might not be so bad. It will definitely cut-down on the time going through the different ranges. – Jeff Mercado May 19 '11 at 06:07
  • @Jeff - not necessarily. Python is very good at optimising simple list/generator comprehensions, and the "any" linear search can be very fast for small numbers of ranges. A binary search is a tad more complex, may not optimise so well, and may not use cache so well. Asymptotically faster solutions are only guaranteed faster for sufficiently large n and the n here isn't that large. –  May 19 '11 at 06:36
0

A bit of a silly approach to an old question, but I was curious how well regex character classes would handle the problem, since this exact problem occurs frequently in questions about character validity.

To make a regex for the "M" postal codes you showed, we can turn the numbers into unicode using chr():

m_ranges = [(1000, 2249), (2545, 2575), (2640, 2686)]
m_singletons = [2890]
m_range_char_class_members = [fr"{chr(low)}-{chr(high)}" for (low, high) in m_ranges]
m_singleton_char_class_members = [fr"{chr(x)}" for x in m_singletons]
m_char_class = f"[{''.join(m_range_char_class_members + m_singleton_char_class_members)}]"
m_regex = re.compile(m_char_class)

Then a very rough benchmark on 1 million random postal codes for this method vs your original if-statement:

test_values = [random.randint(1000, 9999) for _ in range(1000000)]


def is_m_regex(num):
    return m_regex.match(chr(num))

def is_m_if(num):
    return 1000 <= num <= 2249 or 2545 <= num <= 2575 or 2640 <= num <= 2686 or num == 2890


def run_regex_test():
    start_time = time.time()
    for i in test_values:
        is_m_regex(i)
    print("--- REGEX:  %s seconds ---" % (time.time() - start_time))

def run_if_test():
    start_time = time.time()
    for i in test_values:
        is_m_if(i)
    print("--- IF:  %s seconds ---" % (time.time() - start_time))
...
running regex test
--- REGEX:  0.3418138027191162 seconds ---
--- IF:  0.19183707237243652 seconds ---

So this would suggest that for comparing one character at a time, using raw if statements is faster than character classes in regexes. No surprise here, since using regex is a bit silly for this problem.

BUT. When doing a an operation like sub to eliminate all matches from a string composed of all the original test values, it ran much quicker:

blob_value = ''.join([chr(x) for x in test_values])

def run_regex_test_char_blob():
    start_time = time.time()
    subbed = m_regex.sub('', blob_value)
    print("--- REGEX BLOB:  %s seconds ---" % (time.time() - start_time))
    print(f"original blob length : {len(blob_value)}")
    print(f"sub length : {len(subbed)}")
...
--- REGEX BLOB:  0.03655815124511719 seconds ---
original blob length : 1000000
sub length : 851928

The sub method here replaces all occurrences of M-postal-characters (~15% of this sample), which means it operated on all 1 million characters of the string. That would suggest to me that mass operations by the re package are MUCH more efficient than individual operations suggested in these answers. So if you've really got a lot of comparisons to do at once in a data pipeline, you may find the best performance by doing some string composition and using regex.

Indigenuity
  • 9,332
  • 6
  • 39
  • 68
0

In Python 3.2 functools.lru_cache was introduced. Your solution along with the beforementioned decorator should be pretty fast. Or, Python 3.9's functools.cache could be used as well (which should be even faster).

Pirulax
  • 80
  • 1
  • 7
0

Another solution is based on looking at the data set limits. When possible, you make a table(s) that encompasses the range of values reduced to a common denominator for quick lookup. Sort of like a hash table.
Borrowing the idea from the tricks OS kernel developers do for quick address range checks on memory ranges that are usually in 4096 (page size) multiples.

Over the OPs original set of values: (1000,2429), (2545,2575), (2640,2686), (2890, 2890)
If you divide these by 100 you will get a set of buckets: (10 to 24), 25, 26, and 28

We generate two tables then:

ranges = ((1000,2429), (2545,2575), (2640,2686), (2890, 2890))
buckets = {10:0,11:0,12:0,13:0,14:0,15:0,16:0,17:0,18:0,19:0,20:0,21:0,22:0,23:0,24:0, 25:1, 26:2, 28:3}

Our test function:

def find_m(number: int):
    # In a bucket?
    index = buckets.get(number // 100)
    if index != None:
        # Yes, is it in range?
        if ranges[index][0] <= number <= ranges[index][1]:
            # Yes
            return 'M'
    return None

Lets test it:

print(find_m(1500))    
print(find_m(2600))
print(find_m(2890))
print(find_m(1))

The (correct) output:

M
None
M
None

This might be an optimal solution speed-wise since we reduce the range check to just a divide, a dictionary lookup, and a verifying range check. About as fast as Python does it's internal native code dictionary hash table lookup.

Extra points:

  • Would want to turn this into a container class probably. Generate the "buckets" dictionary procedurally to avoid doing it by hand (like I did here) and to fit the general use case.
  • Depending on the optimal divisor/split value, might end up with collisions. So like a traditional hash table, we would need to expand the "ranges" list into a lists of lists (tuples technically). And test the input number against the matching index set. IF there are many collusions there might be room for further range testing optimizations (as the proverb below dictates).
  • Like any data container, it might not be optimal for every use case. The OP didn't specify other than the fact that his example is hard coded numbers. So insertions or deletions shouldn't matter here (as the proverb dictates), but could matter for other cases.
  • I feel like I'm missing something here mathematically. I believe it would be possible to take the remainder of the divide by 100 and do a shift that would generate a 32bit hash over 100 buckets, or something.., to become more of a hash table directly.

Proverb: "know thy data"

Sirmabus
  • 636
  • 8
  • 8