5

I want to generate an ordered list of the least common words within a large body of text, with the least common word appearing first along with a value indicating how many times it appears in the text.

I scraped the text from some online journal articles, then simply assigned and split;

article_one = """ large body of text """.split() 
=> ("large","body", "of", "text")

Seems like a regex would be appropriate for the next steps, but being new to programming I'm not well versed- If the best answer includes a regex, could someone point me to a good regex tutorial other than pydoc?

Benjamin James
  • 941
  • 1
  • 9
  • 24
  • Are you looking for something like "the 10 least common words", or for something like "all words with fewer than 5 appearances" (which could be 3 in one run, 69105 in another)? I ask because `heapq.nsmallest` is probably your best bet for the former, but a heap may not be as good for the latter. – abarnert Jan 31 '13 at 02:19

5 Answers5

4

How about a shorter/simpler version with a defaultdict, Counter is nice but needs Python 2.7, this works from 2.5 and up :)

import collections

counter = collections.defaultdict(int)
article_one = """ large body of text """

for word in article_one.split():
    counter[word] += 1

print sorted(counter.iteritems(), key=lambda x: x[::-1])
Wolph
  • 78,177
  • 11
  • 137
  • 148
  • 2
    If you truely need the 10 least common and don't care about the order of the others, you could probably work with `heapq` to get the minimum elements in O(N) rather than O(NlogN). – mgilson Jan 31 '13 at 01:43
  • @mgilson: wouldn't you still need to store all of them? Otherwise you don't know which are and which aren't in there. – Wolph Jan 31 '13 at 01:45
  • Yeah, you still need the Counter or defaultdict as you have here. I'm just proposing an alternative to the last line (`sorted`) – mgilson Jan 31 '13 at 01:46
  • Ah... now I understand what you meant, wouldn't adding them to the heapq still be `n` times a `log n` operation? Depends on the variant of course: http://en.wikipedia.org/wiki/Heap_(data_structure)#Comparison_of_theoretic_bounds_for_variants – Wolph Jan 31 '13 at 01:49
  • Shouldn't sort be ``print sorted(counter.iteritems(), key=itemgetter(1))`` to sort by ``count`` and not ``word``. – sotapme Jan 31 '13 at 01:50
  • @sotapme: this sorts on count first and word second, your sorts on count first and has an arbitrary sort for the word :) – Wolph Jan 31 '13 at 01:51
  • @WoLpH -- the docs on `heapify` state that it works in linear time. and the docs imply that `heapq.nsmallest` is better than `sorted` for small `n`. Also `heapq` is what the collections module uses for `Counter.most_common`. – mgilson Jan 31 '13 at 01:53
  • Well I've just run the above with txt as ``a large body of text large enough of epic body`` and get ``[('a', 1), ('body', 2), ('enough', 1), ('epic', 1), ('large', 2), ('of', 2), ('text', 1)]`` – sotapme Jan 31 '13 at 02:12
  • SO needs a http://jsfiddle.net/ type thing where code can be run and results shown. Even copy/pasting code from SO can be a real faff. – sotapme Jan 31 '13 at 10:05
3

Finding least common elements in a list. According to Counter class in Collections module

c.most_common()[:-n-1:-1]       # n least common elements

So Code for least common element in list is

from collections import Counter
Counter( mylist ).most_common()[:-2:-1]

Two least common elements is

from collections import Counter
Counter( mylist ).most_common()[:-3:-1]

dwalsh84
  • 61
  • 4
1

This uses a slightly different approach but it appears to suit your needs. Uses code from this answer.

#!/usr/bin/env python
import operator
import string

article_one = """A, a b, a b c, a b c d, a b c d efg.""".split()
wordbank = {}

for word in article_one:
    # Strip word of punctuation and capitalization
    word = word.lower().strip(string.punctuation)
    if word not in wordbank:
        # Create a new dict key if necessary
        wordbank[word] = 1
    else:
        # Otherwise, increment the existing key's value
        wordbank[word] += 1

# Sort dict by value
sortedwords = sorted(wordbank.iteritems(), key=operator.itemgetter(1))

for word in sortedwords:
    print word[1], word[0]

Outputs:

1 efg
2 d
3 c
4 b
5 a

Works in Python >= 2.4, and Python 3+ if you parenthesize the print statement at the bottom and change iteritems to items.

Community
  • 1
  • 1
ashastral
  • 2,818
  • 1
  • 21
  • 32
0

ready made answer from the mothership.

# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

>>> def least_common(adict, n=None):
.....:       if n is None:
.....:               return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....:       return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))

Obviously adapt to suite :D

sotapme
  • 4,695
  • 2
  • 19
  • 20
  • This requires python 2.7, which is fine see this post if you have to use <2.7. http://stackoverflow.com/questions/2161752/how-to-count-the-frequency-of-the-elements-in-a-list – Brian Larsen Jan 31 '13 at 01:37
  • Well I did say __adapt to suite__:( - Although they seem to have forgotten about ``least_commom()`` :(( – sotapme Jan 31 '13 at 01:45
  • If you look at the [source](http://hg.python.org/cpython/file/2.7/Lib/collections.py) for `most_common`, it's pretty easy to see how to adapt it for a `least_common`. (I believe they use `heapq`) – mgilson Jan 31 '13 at 01:47
  • @mgilson good point - just looked - love the comment in the code `` # Emulate Bag.sortedByCount from Smalltalk`` – sotapme Jan 31 '13 at 01:54
  • 3
    @sotapme shorter version: `collections.Counter(article_one.split())` – Wolph Jan 31 '13 at 01:54
  • I'll pull from the dict. in the first example to get least common. thanks people. – Benjamin James Jan 31 '13 at 02:41
  • @mgilson just came across this; c.most_common()[:-n:-1] # n least common elements – Benjamin James Jan 31 '13 at 05:02
  • @mgilson I just pulled it from pydoc, [here](http://docs.python.org/2/library/collections.html?highlight=coun#collections.Counter.most_common), though Im not sure about how to implement it – Benjamin James Jan 31 '13 at 17:49
  • @BenjaminJames -- Oh, I see -- That's the same thing as sorting the entire `dict.items()`, and then taking the last few, so you're correct, it will work, but it's O(nlogn) – mgilson Jan 31 '13 at 17:55
  • Doesn't actually answer the question. Doesn't give a clear path to answer the question, and also has a very non-idiomatic use of `Counter` as a `defaultdict(int)` which shouldn't be encouraged. – Slater Victoroff Jan 09 '18 at 18:05
0

If you need a fixed number of least-common words, e.g., the 10 least common, you probably want a solution using a counter dict and a heapq, as suggested by sotapme's answer (with WoLpH's suggestion) or WoLpH's answer:

wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)

However, if you need an unbounded number of them, e.g., all words with fewer than 5 appearances, which could be 6 in one run and 69105 in the next, you might be better of just sorting the list:

wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)

Sorting takes longer than heapifying, but extracting the first M elements is a lot faster with a list than a heap. Algorithmically, the difference is just some log N factors, so the constants are going to be important here. So the best thing to do is test.

Taking my code at pastebin, and a file made by just doing cat reut2* >reut2.sgm on the Reuters-21578 corpus (without processing it to extract the text, so this is obviously not very good for serious work, but should be fine for benchmarking, because none of the SGML tags are going to be in the least common…):

$ python leastwords.py reut2.sgm # Apple 2.7.2 64-bit
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm # python.org 3.3.0 64-bit
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm # 1.9.0/2.7.2 64-bit
heap: 23.95291996
sort: 16.1843900681

I tried various ways to speed up each of them (including: takewhile around a genexp instead of a loop around yield in the heap version, popping optimistic batches with nsmallest and throwing away any excess, making a list and sorting in place, decorate-sort-undecorate instead of a key, partial instead of lambda, etc.), but none of them made more than 5% improvement (and some made things significantly slower).

At any rate, these are closer than I expected, so I'd probably go with whichever one is simpler and more readable. But I think sort beats heap there, as well, so…

Once again: If you just need the N least common, for reasonable N, I'm willing to bet without even testing that the heap implementation will win.

abarnert
  • 354,177
  • 51
  • 601
  • 671