How to count words in a corpus document

Question

I want to know the best way to count words in a document. If I have my own "corp.txt" corpus setup and I want to know how frequently "students, trust, ayre" occur in the file "corp.txt". What could I use?

Would it be one of the following:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

Thanks, Ray

Neither one of those are provided by the standard python library. Are you sure you're not thinking of NLTK? — Chris Eberle, Nov 15 '11 at 16:00
Looking at your name, i'm gonna pretend that you know what "students trust ayre" means. Anyway, i would go with `FreqDist`. `fdist = FreqDist(); for word in tokenize.whitespace(sent): fdist.inc(word.lower())`. You can check the doc [here](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html). — aayoubi, Nov 15 '11 at 16:02
I edited the answer please double check it for me. Thank you — Ray Hmar, Nov 15 '11 at 16:13
possible duplicate of [How optimize word counting in Python?](http://stackoverflow.com/questions/22849919/how-optimize-word-counting-in-python) — alvas, Apr 07 '14 at 04:59

score 12 · Answer 1 · answered Apr 06 '14 at 19:06

I would suggest looking into collections.Counter. Especially for large amounts of text, this does the trick and is only limited by the available memory. It counted 30 billions tokens in a day and a half on a computer with 12Gb of ram. Pseudocode (variable Words will in practice be some reference to a file or similar):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

When finished the words are in a dictionary my_counter which then can be written to disk or stored elsewhere (sqlite for example).

Note that Counter() can take a list as input, so if W is a list of words: `counts = Counter(W)` will do the trick — Lars GJ, Jul 15 '22 at 08:54

score 4 · Answer 2 · answered Jul 11 '12 at 18:41

You are almost there! You can index the FreqDist using the word you are interested in. Try the following:

print fdist['students']
print fdist['ayre']
print fdist['full']

This gives you the count or number of occurrences of each word. You said "how frequently" - frequency is different to number of occurrences - and that can got like this:

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

Chris Eberle · Answer 3 · 2011-11-15T21:44:43.547

4

Most people would just use a defaultdictionary (with a default value of 0). Every time you see a word, just increment the value by one:

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

edited Nov 15 '11 at 21:44

answered Nov 15 '11 at 16:01

Chris Eberle

47,994
12
82
119

You mean `defaultdict(int)` -- `defaultdict` takes a callable. – kindall Nov 15 '11 at 21:39
@Chris how about using `Counter`? – alvas Apr 07 '14 at 05:00

score 1 · Answer 4 · answered Apr 07 '14 at 05:10

You can read a file and then tokenize and put the individual tokens into a FreqDist object in NLTK, see http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[out]:

'blah' occurred 3 times

Alternatively, you can use a native Counter object from collections and you get the same counts, see https://docs.python.org/2/library/collections.html. Note that the keys in the FreqDist or Counter object are case sensitive, so you might also want to lowercase your tokenize:

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"

How to count words in a corpus document

4 Answers4

Linked