0

I have a dictionary in some code which maps a key to a word, the key is the result of an md5 hash. I have code that essentially wants to get the key for a word, and when it doesn't already exist, add it to the dictionary

Here was my first implementation:

key = int(hashlib.md5(word).hexdigest(), 16)
if key in self.id_to_word.keys():
    assert word == self.id_to_word[key]
else:
    self.id_to_word[key] = word
return key

After profiling my code I found this to be EXTREMELY slow. So then I tried this, which is functionally equivalent

key = int(hashlib.md5(word).hexdigest(), 16)

try:
    assert word == self.id_to_word[key]
    return key
except KeyError:
    self.id_to_word[key] = word

This turned out to be incredibly faster. While I'm certainly happy about the performance improvement, I was wondering if someone could explain to me why. Is it bad practice to check for something in a keys() function from a dictionary like that? Is it generating copies of that every time (wasting a lot of computation)?

sedavidw
  • 11,116
  • 13
  • 61
  • 95

5 Answers5

3

id_to_word.keys() creates a new list, which is linearly searched, which is much slower than a hash lookup. Remove the .keys().

The fastest way would be:

key = int(hashlib.md5(word).hexdigest(), 16)
assert word == self.id_to_word.setdefault(key, word)
Daniel
  • 42,087
  • 4
  • 55
  • 81
  • The only issue with this is that `assert` should not be used in production code. `python -O` (the optimize flag) will remove an assert from code. At that time, your code breaks. – Deacon Apr 13 '15 at 18:08
2

This is to be expected (in python2). The keys() method returns a list of keys. So using the in operator on the list takes linear time.

Trying to access the item is constant time, which is much faster.

Note: you can simply use key in dictionary instead of the try: ...except:.


Note that dictionaries have a setdefault method that already does what you want. Moreover if you do that operation a lot of time you should consider using collections.defaultdict instead of a plain dictionary.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
2

key in some_dict is much faster than key in some_dict.keys()

Dict Lookup key in some_dict is O(1) complexity so its very fast

that said its still (very marginally)slower in the case where the key is in the dict than just try/except

the real answer is there is no real measurable difference between these 2 methods and do whatever feels right to you

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
1

Elaborating on my comments above:

In [4]: d = {k:k for k in xrange(1000)}
In [5]: %timeit 50 in d
10000000 loops, best of 3: 68.6 ns per loop

In [6]: %timeit 50 in d.keys()
100000 loops, best of 3: 6.35 µs per loop

as you can see, using d.keys() is about 100 times slower.

reptilicus
  • 10,290
  • 6
  • 55
  • 79
0

As mentioned setdefault() solves your problem without an if or try block.
But "Easier to Ask for Forgiveness than Permission" [EAFP] and duck-typing is a common idiom in Python, compared to the more defensive "Look Before You Leap" [LBYL] idiom common in other languages, e.g. Java, C++.

Jeff Knuth has an interesting blog post on it Write Clean Python: Use Exceptions

AChampion
  • 29,683
  • 4
  • 59
  • 75