Is the defaultdict in Python's collections module really faster than using setdefault?

Question

I've seen other Python programmers use defaultdict from the collections module for the following use case:

from collections import defaultdict

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

def main():
    d = defaultdict(list)
    for k, v in s:
        d[k].append(v)

I've typically approached this problem by using setdefault instead:

def main():
    d = {}
    for k, v in s:
        d.setdefault(k, []).append(v)

The docs do in fact claim that using defaultdict is faster, but I've seen the opposite to be true when testing myself:

$ python -mtimeit -s "from withsetdefault import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 4.51 usec per loop
$ python -mtimeit -s "from withdefaultdict import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 5.38 usec per loop

Is there something wrong with how I've set up the tests?

For reference, I'm using Python 2.7.3 [GCC 4.2.1 (Apple Inc. build 5666)

did one test file contain `s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]` where the other didn't? — Claudiu, Sep 23 '12 at 20:28
You need to use *much* larger test sets where there is more than 2 values per key. — Martijn Pieters, Sep 23 '12 at 20:32
@MartijnPieters: i did. i don't know what is in `withsetdefault.py` and `withdefaultdict.py`. if one of them contains that line and the other doesn't, that could account for the difference — Claudiu, Sep 23 '12 at 20:33

score 25 · Accepted Answer · answered Sep 23 '12 at 20:38

Yes, there is something "wrong":

You have put the creation of the (default)dict into the statement instead of the setup. Constructing a new defaultdict is more expensive than a normal dict, and usually that's not the bottleneck you should be profiling in a program - after all, you build your data structures once but you use them many times.

If you do your tests like below, you see that defaultdict operations are indeed faster:

>>> import timeit
>>> setup1 = """from collections import defaultdict
... s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = defaultdict(list)"""
>>> stmt1 = """for k, v in s:
...     d[k].append(v)"""
>>> setup2 = """s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = {}"""
>>> stmt2 = """for k, v in s:
...     d.setdefault(k, []).append(v)"""
>>> timeit.timeit(setup=setup1, stmt=stmt1)
1.0283400125194078
>>> timeit.timeit(setup=setup2, stmt=stmt2)
1.7767367580925395

Python 2.7.3 on Win7 x64.

Thanks Tim! When I remove the defaultdict constructor from the test statement, it's definitely faster that using setdefault...regardless of the data size. — damzam, Sep 23 '12 at 21:24
You only talk about constructing the (default)dict being the difference, but I'd say you also have a second difference that might be significant: In the original, 60% of all keys were new and required an insertion and used the `[]`. In your version, due to repeating each snippet a million times, it's almost 0%. — superb rain, Nov 15 '20 at 13:16

Is the defaultdict in Python's collections module really faster than using setdefault?

1 Answers1

Linked