1

I know that I can use collections.Counter to count elements from a list, and its most_common method to see which element was most common.

However, when there are multiple elements that have the same frequency, it appears that this will break ties in favor of whichever element appeared first:

>>> from collections import Counter
>>> coun = Counter([1, 3, 2, 2, 3])
>>> coun.most_common(1)
[(3, 2)]
>>> coun = Counter([1, 2, 3, 2, 3])
>>> coun.most_common(1)
[(2, 2)]

How can I make it so that the 2 element (and its count, here also 2) is reported, regardless of the order of the input list? In general, I want the smallest of the elements that is tied for most common in the list.

I suppose I could just sort the input list, but is there a faster way?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Uday P
  • 71
  • 1
  • 5
  • Sorting the list is not a problem unless it's REALLY big (100k+ items). And even then I would test if it would be slow – Exelian Aug 14 '20 at 07:10
  • @Exelian Yes. I believe the input lists are quite large according to the test cases, So i didn't want to take chances and searching if there's any simpler method – Uday P Aug 14 '20 at 07:13

2 Answers2

5

Depending on the amount of duplicates you are expecting you could simply check more of the most_common values? Assuming that there's no more than 100 values with exactly the same amount you could simply do:

print(sorted(coun.most_common(100))[0])

You could use a different values for 100 of course. But now the list to sort would be at most 100 tuples, which of course isn't a problem.

Exelian
  • 5,749
  • 1
  • 30
  • 49
  • Just as a note, this will not sort correctly if the elements to be sorted are strings representing numbers: `sorted(Counter(['12', '1', '2', '9']).most_common(3))` will return `[('1', 1), ('12', 1), ('2', 1)]` – ChatterOne Aug 14 '20 at 07:27
  • Ah, yes, definitely. The question had numbers in them so I assumed that wouldn't be an issue, but you can also add a `key` function to the `sorted` call to resolve any sorting problems – Exelian Aug 14 '20 at 07:37
  • @ChatterOne if the elements are strings, then they *should be expected* to sort in that "wrong" order by default - whoever set the problem should want `'12'` to be a "smaller" element than `'2'`, or else would have used integers in the first place. Of course, a custom sort order is easy enough to specify. – Karl Knechtel Mar 08 '23 at 17:05
0

From the source code of the reference implementation, in the case where an integer is passed to Counter.most_common, the result is calculated like so:

return heapq.nlargest(n, self.items(), key=_itemgetter(1))

using the standard library heapq, and where _itemgetter is

from operator import itemgetter as _itemgetter

The .items of a Counter are, of course, the key-value pairs as 2-tuples, stored in a dict_items (since Counter subclasses the built-in dict). The key function passed to heapq.nlargest tells the algorithm how to order the elements: according to the value (i.e., element count). (This algorithm is used because it's faster than sorting all the items.)

So, we can simply emulate this logic, passing our own key. The key should sort Counter items by value (count) "forwards", then by key (element) "backwards".

Since the elements in the original list are numeric, we can easily represent that:

import heapq
from collections import Counter

def smallest_most_common(seq):
    return heapq.nlargest(1, Counter(seq).items(), key=lambda i:(i[1], -i[0]))

Testing it:

>>> smallest_most_common([1, 3, 2, 2, 3])
[(2, 2)]
>>> smallest_most_common([1, 2, 3, 2, 3])
[(2, 2)]

However, this breaks for non-numeric keys, because they can't be negated:

>>> # two t's, two c's; the t shows up first but c is "smaller"
>>> smallest_most_common('witchcraft')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in smallest_most_common
  File "/usr/lib/python3.8/heapq.py", line 531, in nlargest
    result = max(it, default=sentinel, key=key)
  File "<stdin>", line 2, in <lambda>
TypeError: bad operand type for unary -: 'str'

However, the element counts will always be numeric. So, a simple trick is to switch to using heapq.nsmallest, and negate the count rather than the elements:

import heapq
from collections import Counter

def smallest_most_common(seq):
    return heapq.nsmallest(1, Counter(seq).items(), key=lambda i:(-i[1], i[0]))

(This is a common trick used for sorting.)

Now everything works:

>>> smallest_most_common([1, 3, 2, 2, 3])
[(2, 2)]
>>> smallest_most_common([1, 2, 3, 2, 3])
[(2, 2)]
>>> smallest_most_common('witchcraft')
[('c', 2)]
>>> smallest_most_common('craftwitch')
[('c', 2)]
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153