If you need a fixed number of least-common words, e.g., the 10 least common, you probably want a solution using a counter dict
and a heapq
, as suggested by sotapme's answer (with WoLpH's suggestion) or WoLpH's answer:
wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)
However, if you need an unbounded number of them, e.g., all words with fewer than 5 appearances, which could be 6 in one run and 69105 in the next, you might be better of just sorting the list:
wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)
Sorting takes longer than heapifying, but extracting the first M elements is a lot faster with a list
than a heap
. Algorithmically, the difference is just some log N
factors, so the constants are going to be important here. So the best thing to do is test.
Taking my code at pastebin, and a file made by just doing cat reut2* >reut2.sgm
on the Reuters-21578 corpus (without processing it to extract the text, so this is obviously not very good for serious work, but should be fine for benchmarking, because none of the SGML tags are going to be in the least common…):
$ python leastwords.py reut2.sgm # Apple 2.7.2 64-bit
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm # python.org 3.3.0 64-bit
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm # 1.9.0/2.7.2 64-bit
heap: 23.95291996
sort: 16.1843900681
I tried various ways to speed up each of them (including: takewhile
around a genexp instead of a loop around yield
in the heap version, popping optimistic batches with nsmallest
and throwing away any excess, making a list
and sorting in place, decorate-sort-undecorate instead of a key, partial
instead of lambda
, etc.), but none of them made more than 5% improvement (and some made things significantly slower).
At any rate, these are closer than I expected, so I'd probably go with whichever one is simpler and more readable. But I think sort beats heap there, as well, so…
Once again: If you just need the N least common, for reasonable N, I'm willing to bet without even testing that the heap implementation will win.