In the other question you referenced, J.F. Sebastian theorized that using sort
would be faster than using Counter
in nearly all cases, but he wisely noted that if you really care you should measure: "In practice it might always be faster then collections.Counter()
(despite asymptotically O(n)
time being better then O(n*log(n))
for .sort()
). Measure it, if it is important."
Such a measurement is shown below. The high points:
Up to list sizes of roughly 10^6
, sort
was faster.
That advantage steadily dwindles as the list size grows.
These experiments use lists that have many discrete values. The benchmarks might be different if the lists had a moderately high percentage of repeated values.
These measurements suggest that the overhead of creating a Counter
is generally larger than the cost of just sorting the data in place. One way to think about this is to view the overhead of the Counter
creation relative to the log(n)
portion of the sort
approach. Not until n
gets fairly large does the algorithmic advantage of the Counter
data structure begin to pay off enough to justify that overhead.
But remember: for small data sizes, the speed advantage might be irrelevant -- unless you are repeating the operation at high volume. In this light, the "algorithmically correct" approach (use a Counter
) might still be better from a code design point of view for most use cases -- because it's faster if the data is large and it's fast-enough if the data is small.
The code:
import timeit, sys, random
from collections import Counter
def prep():
random.shuffle(xs)
random.shuffle(ys)
def use_counter():
prep()
return Counter(xs) == Counter(ys)
def use_sort():
prep()
xs.sort()
ys.sort()
return xs == ys
experiments = [
(3, 10, 100000),
(3, 100, 10000),
(3, 1000, 1000),
(3, 10000, 100),
(3, 100000, 10),
(3, 1000000, 1),
(1, 10000000, 1),
]
for e in experiments:
repeat, list_size, timeit_n = e
xs = list(range(list_size))
ys = list(range(list_size))
r1 = timeit.repeat(use_counter, repeat = repeat, number = timeit_n)
r2 = timeit.repeat(use_sort, repeat = repeat, number = timeit_n)
print
print e
print 'total ', sum(r1), sum(r2)
print 'm1/m2 ', min(r1) / min(r2)
Example output (ratio > 1 means Counter
is slower):
(3, 10, 100000)
total 5.06751918793 2.15432405472
m1/m2 2.34850470872
(3, 100, 10000)
total 3.16299915314 2.06651735306
m1/m2 1.52879303981
(3, 1000, 1000)
total 3.017786026 2.42989587784
m1/m2 1.24086325316
(3, 10000, 100)
total 3.06426525116 2.74061489105
m1/m2 1.11802891855
(3, 100000, 10)
total 3.66198205948 3.35467290878
m1/m2 1.1028159291
(3, 1000000, 1)
total 5.19361901283 5.08777713776
m1/m2 1.03125948765
(1, 10000000, 1)
total 20.0118789673 24.6061840057
m1/m2 0.813286569044