Why is Sort(s1)==Sort(s2) better to use than Counter(s1)==Counter(s2)

Question

Is this considered an efficient way to check if two strings are permutations of eachother? What would be the difference between using two dictionaries instead? Is Counter more optimized for python?

I guess I should clarify, the reason for me posting this is this comment : Check if two unordered lists are equal mentions that it's more efficient to check if two strings are the same by using sort(s1)==sort(s2).Since this is o(nlog(n)), you would assume Counter would have a better run time.

Why don't you use `timeit` on several examples of `string` and check for yourself? Are you asking about short strings or very long strings? Don't forget the option of sorting each string in to a list of characters and comparing the resulting lists. — Rory Daulton, Nov 26 '16 at 00:18
I guess that was my question. Why would checking the sort between two strings be more efficient than counter(s1)=counter(s2) as pointed as http://stackoverflow.com/questions/9623114/check-if-two-unordered-lists-are-equal/9623607#9623607 — Haxet, Nov 26 '16 at 00:39
I hope you mean `==` when you say `=`. It makes a huge difference (and huge errors). — Andras Deak -- Слава Україні, Nov 26 '16 at 01:18
Fixed, would be helpful is someone could respond to the question — Haxet, Nov 26 '16 at 01:30

score 3 · Accepted Answer · edited May 23 '17 at 11:53

In the other question you referenced, J.F. Sebastian theorized that using sort would be faster than using Counter in nearly all cases, but he wisely noted that if you really care you should measure: "In practice it might always be faster then collections.Counter() (despite asymptotically O(n) time being better then O(n*log(n)) for .sort()). Measure it, if it is important."

Such a measurement is shown below. The high points:

Up to list sizes of roughly 10^6, sort was faster.
That advantage steadily dwindles as the list size grows.
These experiments use lists that have many discrete values. The benchmarks might be different if the lists had a moderately high percentage of repeated values.
These measurements suggest that the overhead of creating a Counter is generally larger than the cost of just sorting the data in place. One way to think about this is to view the overhead of the Counter creation relative to the log(n) portion of the sort approach. Not until n gets fairly large does the algorithmic advantage of the Counter data structure begin to pay off enough to justify that overhead.
But remember: for small data sizes, the speed advantage might be irrelevant -- unless you are repeating the operation at high volume. In this light, the "algorithmically correct" approach (use a Counter) might still be better from a code design point of view for most use cases -- because it's faster if the data is large and it's fast-enough if the data is small.

The code:

import timeit, sys, random
from collections import Counter

def prep():
    random.shuffle(xs)
    random.shuffle(ys)

def use_counter():
    prep()
    return Counter(xs) == Counter(ys)

def use_sort():
    prep()
    xs.sort()
    ys.sort()
    return xs == ys

experiments = [
    (3, 10, 100000),
    (3, 100, 10000),
    (3, 1000, 1000),
    (3, 10000, 100),
    (3, 100000, 10),
    (3, 1000000, 1),
    (1, 10000000, 1),
]

for e in experiments:
    repeat, list_size, timeit_n = e
    xs = list(range(list_size))
    ys = list(range(list_size))
    r1 = timeit.repeat(use_counter, repeat = repeat, number = timeit_n)
    r2 = timeit.repeat(use_sort,    repeat = repeat, number = timeit_n)
    print
    print e
    print 'total  ', sum(r1), sum(r2)
    print 'm1/m2  ', min(r1) / min(r2)

Example output (ratio > 1 means Counter is slower):

(3, 10, 100000)
total   5.06751918793 2.15432405472
m1/m2   2.34850470872

(3, 100, 10000)
total   3.16299915314 2.06651735306
m1/m2   1.52879303981

(3, 1000, 1000)
total   3.017786026 2.42989587784
m1/m2   1.24086325316

(3, 10000, 100)
total   3.06426525116 2.74061489105
m1/m2   1.11802891855

(3, 100000, 10)
total   3.66198205948 3.35467290878
m1/m2   1.1028159291

(3, 1000000, 1)
total   5.19361901283 5.08777713776
m1/m2   1.03125948765

(1, 10000000, 1)
total   20.0118789673 24.6061840057
m1/m2   0.813286569044

kindall · Answer 2 · 2016-11-26T00:27:32.730

1

It's O(n), so theoretically efficient. A Counter is a dictionary subclass so the only difference is that you'd have to build the dictionary in Python rather than letting the Counter class (claimed to be "high-performance" so I assume implemented in C) do it. Both would be the same big-O but the Counter would be faster.

It's a different big-O, but sorting both strings is actually about four times faster than Counter on my system for shortish strings.

edited Nov 26 '16 at 00:27

answered Nov 26 '16 at 00:21

kindall

178,883
35
278
309

Why is counter less efficient than using sort to compare two strings as pointed out in this answer: http://stackoverflow.com/questions/9623114/check-if-two-unordered-lists-are-equal/9623607#9623607 – Haxet Nov 26 '16 at 00:37
I'm going to guess that sorting spends more time in C-land. – kindall Nov 26 '16 at 05:11

Why is Sort(s1)==Sort(s2) better to use than Counter(s1)==Counter(s2)

2 Answers2