Let r
be a string,we want to count the number of each character in r
. If we reason quickly:
Counter(r)
is about 100 times faster than
{c:r.count(c) for c in set(r)}
Indeed: in a normal text, there is about 100 distinct characters (cap/uncap/punctuation/numbers...) hence .count
will run 100 times over all the string r
instead of Counter
which will run only one time.
However, timing does not agree with the above reasoning (r
is the content of all the "Lord of the Ring" books):
In [71]: %timeit d = collections.Counter(r)
10 loops, best of 3: 98.8 ms per loop
In [72]: %timeit d = {c:r.count(c) for c in set(r)}
10 loops, best of 3: 114 ms per loop
In [73]: len(r)
Out[73]: 972550
Even if we increase the size of the string, the ratio is the same
In [74]: r = r*100
In [79]: %time d = collections.Counter(r)
CPU times: user 9.9 s, sys: 12 ms, total: 9.91 s
Wall time: 9.93 s
In [81]: %time d = {c:r.count(c) for c in set(r)}
CPU times: user 11.5 s, sys: 0 ns, total: 11.5 s
Wall time: 11.6 s
I've read the source of .count
, it does not do any caching/operation (unless I did not look at the good implementation function): https://hg.python.org/cpython-fullhistory/file/tip/Objects/stringlib/fastsearch.h . How do you explain this fact ?
EDIT : my config : Python 3.4.3 (default, Mar 26 2015, 22:07:01) on Lubuntu 15.04.