3

I frequently need to check if a key is in a dictionary, then do something else if it is not. In python, there are two clear ways to do this:

if key in dict_:
   value = dict_[key]
   do_something
else:
   do_something_else

or

try:
    value = dict_[key]
except KeyError:
    do_something_else
else:
    do_something

Which of these is faster/preferable? Does it depend on the dictionary size?

It seems there may be two competing effects here: 1) having to search for a key twice, vs. 2) setting up an exception stack trace.

Luke Davis
  • 2,548
  • 2
  • 21
  • 43
  • Hash map look ups are O(1). If you don't want to raise an exception I would go with first. If you are really concerned about that time I think a `defaultdict` from the `collections` module could help you out. – ericstevens26101 Apr 28 '20 at 03:53
  • 1
    @ericstevens26101 `defaultdict` probably is not what you want if you are hoping to `do_something_else` when the value is not in the dictionary. – Mark Apr 28 '20 at 03:55
  • Also, `dict`s in python are no longer just hashmaps ;). They are b-trees-ish for performance and ordering reasons. The theoretical complexity and practical complexity are going to be different here. – modesitt Apr 28 '20 at 03:56
  • 1
    @modesitt no, absolutely not, they are (fancy) hashmaps. They are absolutely not b-trees. – juanpa.arrivillaga Apr 28 '20 at 04:00
  • That was what I recalled was used for order preservation over the table - the "fancy" part. [Here](https://mail.python.org/pipermail/python-dev/2012-December/123028.html) is the tldr of the implementation. This is why i said "no longer just' @juanpa.arrivillaga. I did not say "they are not"... – modesitt Apr 28 '20 at 04:02
  • 1
    They are hash tables + links for ordering.. access is still O(1). A B-tree implementation would change that and break expectations. – user2864740 Apr 28 '20 at 04:03
  • @modesitt yes, I know, but that doesn't describe a B-tree. In any case, B-trees keep items in *sorted* order, not in insertion order (I suppose, you *could* use some sort of sorting key which is just the insertion order...). You can read a high-level overview directly in the [source code](https://github.com/python/cpython/blob/master/Objects/dictobject.c), note, the first line is a comment `/* Dictionary object implementation using a hash table */` – juanpa.arrivillaga Apr 28 '20 at 04:05
  • I just miss-though nodes contained the key's values *and* indexes were used to order nodes. I just thought a b-tree was used on top of the map for keys as the real-time sorting of insert. My main point was big-O here is less meaningful than a performance benchmark as the overhead of `try/except`, the implementation of `.get`, etc are what **actually** matter here instead of a non-contributing conversation about `O(...)` to a practical question. – modesitt Apr 28 '20 at 04:07
  • @MarkMeyer Depending on the use case, using a default dict would return something that would evaluate to False in an if statement: `if d['bad_key']: do_something_else`, then the `do_something_else` could be performed as a result of the if statement, faster than looking up the key again in a large dict according to everyone above. – ericstevens26101 Apr 28 '20 at 04:08

1 Answers1

3

You can benchmark three different methods with timeit. get1 is the try...except, get2 uses the builtin .get, and get3 first checks.

In [1]: def get1(d, k): 
......:     try: 
......:         return d[k] 
......:     except KeyError: 
......:         return None 

In [3]: def get2(d, k): 
......:     return d.get(k) 

In [4]: def get3(d, k): 
......:    if k in d: return d[k]  
......:    else: return None 

On a small dictionary (100 elements)

In [8]: %timeit -n 100 [get1(little_d, e) for e in range(len(little_d))]                
18.8 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit -n 100 [get2(little_d, e) for e in range(len(little_d))]                
22.5 µs ± 352 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit -n 100 [get3(little_d, e) for e in range(len(little_d))]               
19.3 µs ± 862 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

And a bigger one (1M elements)

In [11]: %timeit -n 100 [get1(big_d, e) for e in range(len(little_d))]
19.4 µs ± 469 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit -n 100 [get2(big_d, e) for e in range(len(little_d))]
21.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit -n 100 [get3(big_d, e) for e in range(len(little_d))]
19.2 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

On a (1M) dictionary with ~50% "misses" (random.choices(range(0, 2*len(big_id)), k=len(big_d))) shows more difference:

In [20]: %timeit -n 100 [get1(big_d, e) for e in choices]                              
514 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit -n 100 [get2(big_d, e) for e in choices]                              
416 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %timeit -n 100 [get3(big_d, e) for e in choices]                              
367 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

.get will still be faster in this case.

In [23]: %timeit -n 100 [big_id.get(e) for e in choices]                                
334 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

They seem close, equally impacted by the size of the dictionary, but differently effected by misses. The only real impacts size is going to have is when python begins thrashing (as it probably? does here). An important note is that get2 is simply slower because the induced overhead of two function calls (get2 and .get) which are expensive-ish in python (as shown in the last test). The misses will result in slower results for a variety of reasons as @user2864740 points out.

tl;dr

I would use .get.


The answer will also largely depend on the speed of the __hash__ and __eq__ implementation of your keys. If __hash__ and __eq__ are slow, the differences of two calls to hash may show making method 1 or 2 (without the extra function) better.

modesitt
  • 7,052
  • 2
  • 34
  • 64
  • 1
    Does the behavior change on “mostly misses” (ie. lots of errors raised)? I would expect that it does due to stack unwinding and additional overhead. – user2864740 Apr 28 '20 at 04:09
  • 1
    hmmm, method 3 does not return the value, just the dict itself. should be `return d[k]`. whats the speed then? – JL Peyret Apr 28 '20 at 04:28
  • nice catch @JLPeyret, I updated the timing results - The ordering happened to stay the same though (despite the typo invalidating `.get3`). – modesitt Apr 28 '20 at 04:32