10

I know that I can hash singular values as keys in a dict. For example, I can hash 5 as one of the keys in a dict.

I am currently facing a problem that requires me to hash a range of values.

Basically, I need a faster way to to do this:

if 0 <= x <= 0.1:
    # f(A)
elif 0.1 <= x <= 0.2:
    # f(B)
elif 0.2 <= x <= 0.3:
    # f(C)
elif 0.3 <= x <= 0.4:
    # f(D)
elif 0.4 <= x <= 0.5:
    # f(E)
elif 0.5 <= x <= 0.6:
    # f(F)

where x is some float parameter of arbitrary precision.

The fastest way I can think of is hashing, but here's the problem: I can use (0.1, 0.2) as a key, but that still is going to cost me O(n) runtime and is ultimately no better than the slew of elifs (I would have to iterate over the keys and check to see if key[0] <= x <= key[1]).

Is there a way to hash a range of values so that I can check the hash table for0.15 and still get #execute B?

If such a hashing isn't possible, how else might I be able to improve the runtime of this? I am working with large enough data sets that linear runtime is not fast enough.

EDIT: In response to cheeken's answer, I must note that the intervals cannot be assumed to be regular. As a matter of fact, I can almost guarantee that they are not

In response to requests in comments, I should mention that I am doing this in an attempt to implement fitness-based selection in a genetic algorithm. The algorithm itself is for homework, but the specific implementation is only to improve the runtime for generating experimental data.

inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • 2
    note that after the first `if` the `0.n <=` part of the latter `elif`s is logically redundant. this would make the code faster. – Dan D. Jan 28 '12 at 05:35
  • Do you have a large number of intervals? – Moox Jan 28 '12 at 05:44
  • Could you explain a little more about what you're doing and why? I find it hard to believe that the branching speed here is dominant unless A through F do almost nothing, in which case I'd have a different set of recommendations. – DSM Jan 28 '12 at 05:45
  • Are all the ranges guaranteed to be contiguous when put together, or could it be `.1 < x < .2` and then skip to `.3 < x < .4` with nothing for `.2 < x < .3`? – Chris Lutz Jan 28 '12 at 05:56
  • @ChrisLutz: The intervals are guaranteed to be contiguous – inspectorG4dget Jan 28 '12 at 06:02
  • How many ranges are we talking about, here? If we're talking about hundreds and thousands of random ranges, a binary search (with bins distributed over the discrete ranges, not the values of the ranges) may be your best bet. – cheeken Jan 28 '12 at 06:05
  • @cheeken: I'm working with `10**3` to `10**4` ranges – inspectorG4dget Jan 28 '12 at 06:06
  • 1
    if you're really concerned with speed, I would consider writing that part of the code in c or assembly and then referencing it with python – Joel Cornett Jan 28 '12 at 06:15
  • You are going to write 1000 to 10000 different functions?? – John Machin Jan 28 '12 at 06:32
  • Nono... it's the same function, but with different parameters. The parameters are the values of the `dict`. I should have made that more clear. I have edited the question to reflect this now – inspectorG4dget Jan 28 '12 at 20:11

4 Answers4

12

As others have noted, the best algorithm you're going to get for this is something that's O(log N), not O(1), with something along the lines of a bisection search through a sorted list.

The easiest way to do this in Python is with the bisect standard module, http://docs.python.org/library/bisect.html. Note, in particular, the example in section 8.5.2 there, on doing numeric table lookups -- it's exactly what you are doing:

>>> def grade(score, breakpoints=[60, 70, 80, 90], grades='FDCBA'):
...     i = bisect(breakpoints, score)
...     return grades[i]
...
>>> [grade(score) for score in [33, 99, 77, 70, 89, 90, 100]]
['F', 'A', 'C', 'C', 'B', 'A', 'A']

Replace the grades string with a list of functions, the breakpoints list with your list of lower thresholds, and there you go.

Brooks Moses
  • 9,267
  • 2
  • 33
  • 57
4

You don't necessarily need to hash an entire range of values. For example, in the scale given above, if you were given 0.15, you can round it up to the 0.2 (the first digit after decimal point), and hash 0.2 instead.

How efficient does this have to be? An alternative you can try is a binary search. Have the interval values stores in sorted order in a list, and do binary search on it. For example:

sorted_list = [ (0.1, function1), (0.2, function2), ....(0.6, function6) ] 

Then you simply do binary search to find the smallest element that is bigger than x. This will yield O(log(n)).

xkrz
  • 166
  • 9
  • I like the O(log n) approach. If it turns out that there's no O(1) solution, then this is what I will go with. – inspectorG4dget Jan 28 '12 at 06:04
  • You could also combine the binary tree with the ideas of what to do if the intervals are regular -- divide the range into regular bins of size 0.1 (or even 0.01?) and then in each bin store a list of the intervals that cross that bin. It's still O(log n) for large n, but the multiplicative constant is a lot lower. – Brooks Moses Jan 28 '12 at 06:05
  • 1
    @inspectorG4dget: There is, indeed, no O(1) solution. – Brooks Moses Jan 28 '12 at 06:09
  • @Moses, exactly what I did in a fitness based selection algorithm written in C. This approach is on average O(Number of intervals) + O(1) * (Number of searches), especially if the numbers to score are drawn randomly. In the worst case it is O(Number of intervals) + O(log N) * (Number of searches). – jimifiki Jan 28 '12 at 07:45
  • @inspectorG4dget: I think one should start identifying O(1) with O(log n). Even hashing depends on the length of what you hash (i.e. log_n of how many things of that size are there) – jimifiki Jan 28 '12 at 07:45
3

If your intervals are regular, you can scale and then floor your operands to the minimum of each range, then pass that result directly into a dict mapping those lower bounds to the appropriate handler.

An example implementation, using the ranges you provided.

# Integerize our 0.1 width intervals; scale by x10
handlerDict = {}
handlerDict[0] = lambda x: ... # 0.1
handlerDict[1] = lambda x: ... # 0.2
handlerDict[2] = lambda x: ... # 0.3
...

# Get the right handler, scaling x by x10; handle
handlerDict[int(10*x)](x, ...)
cheeken
  • 33,663
  • 4
  • 35
  • 42
3

In order to improve the runtime, you could implement a bisection search.

Otherwise you can put the interval-thresholds on a trie.

EDIT: let me propose an implementation:

class IntervalHash():
    def __init__(self,SortedList):
        #check it's sorted 
        self.MyList = []
        self.MyList.extend(SortedList) 
        self.lenlist = len(self.MyList)
    def get_interval(self,a):
        mylen = self.lenlist 
        mypos = 0
        while mylen > 1:
            mylen = (mylen/2 + mylen % 2)
            if mypos + mylen > self.lenlist - 1:
                if self.MyList[self.lenlist - 1] < a:
                    mypos = self.lenlist - 1
                break
            if self.MyList[mypos + mylen] < a:
                mypos += mylen
        if mypos == 0:
            if self.MyList[0] > a: 
                return ("-infty",self.MyList[0],0)
        if mypos == self.lenlist - 1:
            return (self.MyList[mypos],"infty",0)
        return (self.MyList[mypos],self.MyList[mypos+1],0)

A = [0.32,0.70,1.13]
MyHasher = IntervalHash(A)
print "Intervals are:",A
print 0.9 ," is in ",MyHasher.get_interval(0.9)
print 0.1 ," is in ",MyHasher.get_interval(0.1)
print 1.8 ," is in ",MyHasher.get_interval(1.8)

further edits and improvements are welcome! The trie approach is much more involved, in my opinion it would be more appropriate for low level languages.

jimifiki
  • 5,377
  • 2
  • 34
  • 60
  • I like the O(log n) approach. If it turns out that there's no O(1) solution, then this is what I will go with. – inspectorG4dget Jan 28 '12 at 06:03
  • A tree structure of some sort is almost certainly better than a sorted list, yes. (However, I don't think a trie is a particularly good choice of tree -- can you elaborate on how exactly that would work?) – Brooks Moses Jan 28 '12 at 06:13
  • Excellent addition of an example; upvoted! And I agree with your note that this is more appropriate for a lower-level language (whereas Python effectively has this as a built-in module already.) – Brooks Moses Jan 28 '12 at 06:35