Why is the time taken by collections.Counter to construct a frequency map less than a simple dict creation by a for loop?

Question

From what I understand, since count() iterates over the list again for each element it is slower. Since, the Counter and dict construction both iterate over the list once shouldn't the dict construction and counter time results be similar?

I used the https://stackoverflow.com/a/23909767/7125235 as code reference for getting the time values.


import timeit

if __name__ == "__main__":
    code_block = """seen = {}
for i in l:
    if seen.get(i):
        seen[i] += 1
    else:
        seen[i] = 1
"""
    setup = """import random
import string
from collections import Counter
n=1000
l=[random.choice(string.ascii_letters) for x in range(n)]
"""

    t1 = timeit.Timer(
        stmt="Counter(l)",
        setup=setup,
    )

    t2 = timeit.Timer(
        stmt="[[x,l.count(x)] for x in set(l)]",
        setup=setup,
    )
    t3 = timeit.Timer(
        stmt=code_block,
        setup=setup,
    )

    print("Counter(): ", t1.repeat(repeat=3, number=10000))
    print("count():   ", t2.repeat(repeat=3, number=10000))
    print("seen{}:   ", t3.repeat(repeat=3, number=10000))

Output:

Run1:

Counter():  [0.32974308, 0.319977907, 0.301750341]
count():    [6.424047524000001, 6.417152854, 6.450776530000001]
seen{}:    [1.1089669810000018, 1.099655232, 1.116015376]

Run 2:

Counter():  [0.322483783, 0.32464020800000004, 0.33498838900000005]
count():    [6.3235339029999995, 6.48233445, 6.543396192000001]
seen{}:    [1.1192663550000006, 1.1072084830000009, 1.1155270229999985]

Please be aware that ``timeit`` can also work with regular functions. There is no need to cram relevant parts of the question into such illegible strings. — MisterMiyagi, Jan 10 '21 at 18:35
@MisterMiyagi Will take care in the future. I used the code snippet from this answer https://stackoverflow.com/a/23909767/7125235 — sparkstar, Jan 10 '21 at 18:38
check your code. You first make a `set` out of the input list `l` in the `for` loop (effectively reducing the size). Your `dict` solution will not yield same result as `Counter` solution. Also I would run 3 alternatives with same input list. — buran, Jan 10 '21 at 18:40
Can you clarify your "new" question? The entire point of ``Counter`` is to count things. Why are you surprised that it is fast when counting thing? — MisterMiyagi, Jan 10 '21 at 18:56

DeepSpace · Accepted Answer · 2021-01-10T19:04:44.790

TL;DR

Counter.__init__ is using a C loop (in CPython, at least) to count the iterable's elements, see https://github.com/python/cpython/blob/6b1ac809b9718a369aea67b99077cdd682be2238/Modules/_collectionsmodule.c#L2277.

Counter is (mostly, see below) implemented in Python, meaning its code can be inspected, debugged and even modified very easily.

Counter.__init__ in CPython 3.8 (docstring not included):

def __init__(self, iterable=None, /, **kwds):
    super(Counter, self).__init__()
    self.update(iterable, **kwds)

and Counter.update (irrelevant paths not included):

def update(self, iterable=None, /, **kwds):
    if iterable is not None:
        if isinstance(iterable, _collections_abc.Mapping):
            ...
        else:
            _count_elements(self, iterable)
    if kwds:
        ...

and _count_elements:

def _count_elements(mapping, iterable):
    mapping_get = mapping.get
    for elem in iterable:
        mapping[elem] = mapping_get(elem, 0) + 1

However, there is a very important piece of code + comment right below the implementation of _count_elements:

try:  # Load C helper function if available
    from _collections import _count_elements
except ImportError:
    pass

In other words, Counter uses a C loop to count the elements, which is inherently faster than the Python loop you are using.

We can do a small experiment. Comment out the code that imports the C function:

# try:   # Load C helper function if available
#     from _collections import _count_elements
# except ImportError:
#     pass

Executing your code:

Counter():  [1.8369901, 1.8359803000000001, 1.940804]
seen{}:    [1.2392313000000001, 1.2483893999999998, 1.3157528000000003]

Now Counter is actually slower than a plain dict.

score 0 · Answer 2 · answered Jan 10 '21 at 18:49

I corrected your code to perform comparable operations on all three sections: import the same packages (to balance the overhead), and then search the entire list of characters, rather than reducing it to a unique string -- that was your main error in the "seen" code: you counted only one of each character.

import timeit

if __name__ == '__main__':
    code_block = '''seen = {}
for i in char:
    if seen.get(i):
        seen[i] += 1
    else:
        seen[i] = 1
'''

    common = 'import random;import string;from collections import Counter;n=1000;' + \
             'char=[random.choice(string.ascii_letters) for x in range(n)]'
    t1 = timeit.Timer(stmt='Counter(char)',
                      setup=common)

    t2 = timeit.Timer(stmt='[[x,char.count(x)] for x in set(char)]',
                      setup=common)
    t3 = timeit.Timer(stmt=code_block,
                      setup=common)

    print("Counter(): ", t1.repeat(repeat=3, number=10000))
    print("count():   ", t2.repeat(repeat=3, number=10000))
    print("seen{}:    ", t3.repeat(repeat=3, number=10000))

Output:

Counter():  [0.48620019999999997, 0.49807440000000014, 0.3896322000000001]
count():    [9.7432961, 9.620701299999999, 9.674791500000001]
seen{}:     [1.4734368999999994, 1.462895500000002, 1.4655799000000016]

It appears that Counter is, as expected, the fastest.

While you are at correcting the code: ``if seen.get(i):`` should likely just be ``if i in seen:`` — MisterMiyagi, Jan 10 '21 at 18:51
`in` is my own preference, but not OP's. The only non-functional correction I made was to replace `l` with `char`, since `l` is perhaps the worst variable name in existence: meaningless *and* hard to read. — Prune, Jan 10 '21 at 18:53

Why is the time taken by collections.Counter to construct a frequency map less than a simple dict creation by a for loop?

2 Answers2