Most efficient way to remove unique keys from python's Counter() multiset dictionary

Question

I find the Counter() function in python very useful, and it has a built-in Counter.most_common() function to return the keys in order of most to least common. But I haven't found a build_in function to filter out all keys that are unique, or over/under a threshold of appearances. This isn't too difficult to do with a dictionary comphrension, but seems like such a common function, I was wondering (a) why it's not built-in, and (b) did I miss it?

My method, given some data:

x = ['904692239', '904692239', '416618990', '965059531', '904692239', '48644135', '904692239', '386210409', '978527886', '102666625', '793573909', '826761606', '980035897', '837619792', '709804744', '703248895', '904692239', '843796438', '633621488', '374214079', '904692239', '218851385', '359342704', '793949601', '793949601', '216731638', '793949601', '721831814', '715338006', '466865724', '160977579', '971821714', '715338006', '612216206', '658467007', '67897613', '255688245', '457452213', '457452213', '984706137', '160977579', '160977579', '503944932', '261444687', '95794369', '286082560', '974609408', '457408015', '376428770', '636170296', '636170296', '636170296', '721831814']
from collections import Counter
y = Counter(x)
non_uniques = [k for k,v in Counter(x).items() if v > 1]
>>>non_uniques
['715338006', '457452213', '904692239', '160977579', '793949601', '636170296', '721831814']

Is this the most efficient way to do this? (take a list and filter out all non-unique occurrences in that list)

Why would someone supply a builtin function for something the language can express in less than 50 bytes of source code... — Niklas B., Feb 02 '14 at 02:19
@NiklasB. -- `sorted(counter.items(), key=itemgetter(1), reversed=True)[:n]` is only 60 bytes of code and effectively replaces `Counter.most_common`, but we have a method for that because it's super useful (and because we can do a little better with `heapq`)... — mgilson, Feb 02 '14 at 02:31
You might also be interested in this answer: http://stackoverflow.com/a/480227/748858 — mgilson, Feb 02 '14 at 02:32
@mgilson: I think the actual reason is the one you mentioned in your last sentence, that it's actually not that trivial (and if you ask me, that one-liner you wrote there is considerably more complicated than the one OP wrote, which is basically a standard list comprehension) — Niklas B., Feb 02 '14 at 02:38
@NiklasB. -- The point that I was trying to make is that `Counter.most_common` itself could be robustly implemented as a 1-liner (no comparison with OP's code). The point is that if something is really useful, it usually ends up in the stdlib even if it only takes a couple lines of code. However, you need to demonstrate that it's extremely useful enough to warrant a new method and that's the hard part. In this case, I would consider the usage of a Counter to figure out which elements in a list are unique (or not) to be a misuse of Counter... — mgilson, Feb 02 '14 at 03:05
@NiklasB. - My question was prompted by thinking (a) I do this a LOT and (b) there might be a more efficient way to do it, since I run this inside loops usually. And if using Counter() to filter out uniques is a misuse, what is the optimal way to filter them out? — Marc Maxmeister, Feb 03 '14 at 16:15
@Marc: I didn't say it was a misuse, in fact it's a good way to do it your way as far as asymptotic complexity goes. You might get a slightly lower constant factor by using a `set` in the way suggested by the answer mgilson linked to, but you'd have to verify using a simple benchmark — Niklas B., Feb 03 '14 at 19:08

Martijn Pieters · Accepted Answer · 2014-02-02T02:14:37.380

1

No, there is no built-in method, this is not usually a requirement.

You can use itertools.takewhile() to get all 'unique' keys:

from itertools import takewhile

for unique in takewhile(lambda pair: p[1] == 1, reversed(y.most_common())):

or for all non-unique:

for unique in takewhile(lambda pair: p[1] > 1, y.most_common()):

Both methods do require a full sort of the dictionary, so your own method, looping just once, is fine too.

edited Feb 02 '14 at 02:14

answered Feb 02 '14 at 01:59

Martijn Pieters

1,048,767
296
4,058
3,343

using `most_common` you create an intermediary list, unlike the original code (if you run on Python 3 or using `iteritems`) – JBernardo Feb 02 '14 at 02:03
Yup, and `.most_common()` has to sort. – Martijn Pieters Feb 02 '14 at 02:09

Most efficient way to remove unique keys from python's Counter() multiset dictionary

1 Answers1