0

I'm looking for the fastest way to do the following: given a dictionary and a key value, return the lowest key in the dictionary greater than than the value given. Per this question, the natural way would seem to be to create an OrderedDict, then use bisect on the keys to find the proper key location. The OrderedDict.keys() method doesn't support indexing, so per e.g. this question, one has to convert the keys to a list, before doing bisect or similar.

So once an OrderedDict has been created with its keys in order, in order to access the correct position one has to do the following:

  1. Convert the keys to a list
  2. Do a binary search of the keys with bisect or similar.
  3. Check that this insertion point isn't at the end of the list, before retrieving the key located after this index.
  4. Retrieve the key value in our original OrderedDict.

I'm most concerned about step 1 above, from an efficiency perspective (although all of this looks roundabout to me). Without knowing the details of how Python does the conversion to list, it seems like it would have to be O(n), completely eliminating the savings of using OrderedDict and binary search. I'm hoping someone can tell me whether this assumption I have about step 1 is or isn't correct, and regardless whether or not there may be a better method.

As an alternative, I could simply create a list of tuples, sorted by the first element (key), where the second element is the dict value associated with that key. Then I could pass the key lambda x:x[0] to bisect. This seems reasonable, but I'd prefer to store my key / value pairs more uniformly (e.g. JSON), since that's how it's done with other dicts in the same project that don't need this specific type of comparison.

Here's some example code for a single lookup. Edit: But lest anyone think I'm over-optimizing, the actual dictionary has ~3 million keys, and will be accessed ~7 million times in a batch, daily. So I'm very interested in finding a fast way of doing this.

# Single lookup example
from collections import OrderedDict
from bisect import bisect

d = OrderedDict()
d[5] = 'lowest_value'
d[7] = 'middle_value'
d[12] = 'highest_value'

sample_key = 6    # we want to find the value for the key above this in d, e.g. d[7]

list_of_keys = list(d.keys())
key_insertion_index = bisect(list_of_keys,sample_key)

if key_insertion_index < len(list_of_keys):
    next_higher_key = list_of_keys[key_insertion_index]
    v = d[next_higher_key]
Eli Johnson
  • 349
  • 2
  • 10
  • It hardly ever makes sense to optimize a single lookup. You need to be doing lots of them to see any improvement from using tricky techniques. – Mark Ransom Dec 06 '22 at 20:29
  • An ordered dictionary orders the keys in accordance to the order in which the key is defined, not the value. Since you seem to want keys sorted by value, why not just use a standard dictionary and sort the keys using something like sorted(list(d.keys()))? – itprorh66 Dec 06 '22 at 20:57
  • @MarkRansom I am doing lots of lookups, I just wrote the sample code with a single lookup in mind – Eli Johnson Dec 06 '22 at 21:55
  • @itprorh66 It's easy for me to sort them simultaneously by value and by definition, since they are being written to the dictionary from a pyodbc cursor that is already sorted. As a result, using a standard dictionary and then sorting the keys later would basically throw away all the sorting work that has already happened. – Eli Johnson Dec 06 '22 at 22:01
  • 2
    Note that since Python 3.6 there's no practical difference between `OrderedDict` and `dict`: https://stackoverflow.com/a/60775590/5987 – Mark Ransom Dec 06 '22 at 22:58
  • @MarkRansom Thanks, that's helpful, since it means I can use a regular dict. It still leaves open the question of how to avoid converting to list, since regular dict keys also don't support indexing. – Eli Johnson Dec 06 '22 at 23:05
  • 1
    I don't think you need to worry about converting to list. The object returned by `keys()` may not support indexing, but it does support `len()` so the list should be preallocated to the proper size - the copy should be relatively fast. – Mark Ransom Dec 06 '22 at 23:14
  • @MarkRansom Sorry, but I fail to see how a copy could be less than O(n) (in which case we could just iterate through the keys instead of attempting to do a binary search). If the binary search could be performed directly on the keys it would be O(log n). – Eli Johnson Dec 06 '22 at 23:23
  • 1
    If you need to do a lot of such bisects, maybe you could store the sorted list of keys in a list that you keep alongside your dict? – joanis Dec 06 '22 at 23:26
  • @joanis That's a thought. The dict is static, so it shouldn't add any overhead costs. My only aversion is it seems a bit clunky. But honestly that seems like it could be the way to go. – Eli Johnson Dec 06 '22 at 23:29
  • Actually, I might even create a class that contains that list and that dict together, to create clear encapsulation that the two things belong together. Then you can make your class design as clean as you want. – joanis Dec 06 '22 at 23:30
  • *“As an alternative, I could simply create a list of tuples, sorted by the first element (key), where the second element is the dict value associated with that key. Then I could pass the key `lambda x:x[0]` to bisect.”* Yes, do that. You have a specific reason not to use the same data structure as the others, so…. Also worth benchmarking two separate lists (one of keys, one of values) since bisect gives you an index. – Ry- Dec 07 '22 at 00:03
  • 1
    Yes, a copy will be O(n) - but only once. If you can reuse that copy to do many O(log n) searches, the overhead will matter much less. It will certainly be better overall than doing the same number of O(n) searches. – Mark Ransom Dec 07 '22 at 02:50

0 Answers0