4

I want to know the best way to count words in a document. If I have my own "corp.txt" corpus setup and I want to know how frequently "students, trust, ayre" occur in the file "corp.txt". What could I use?

Would it be one of the following:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

Thanks, Ray

octopusgrabbus
  • 10,555
  • 15
  • 68
  • 131
Ray Hmar
  • 113
  • 2
  • 5
  • 13
  • 1
    Neither one of those are provided by the standard python library. Are you sure you're not thinking of NLTK? – Chris Eberle Nov 15 '11 at 16:00
  • Looking at your name, i'm gonna pretend that you know what "students trust ayre" means. Anyway, i would go with `FreqDist`. `fdist = FreqDist(); for word in tokenize.whitespace(sent): fdist.inc(word.lower())`. You can check the doc [here](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html). – aayoubi Nov 15 '11 at 16:02
  • I edited the answer please double check it for me. Thank you – Ray Hmar Nov 15 '11 at 16:13
  • possible duplicate of [How optimize word counting in Python?](http://stackoverflow.com/questions/22849919/how-optimize-word-counting-in-python) – alvas Apr 07 '14 at 04:59

4 Answers4

12

I would suggest looking into collections.Counter. Especially for large amounts of text, this does the trick and is only limited by the available memory. It counted 30 billions tokens in a day and a half on a computer with 12Gb of ram. Pseudocode (variable Words will in practice be some reference to a file or similar):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

When finished the words are in a dictionary my_counter which then can be written to disk or stored elsewhere (sqlite for example).

Lars GJ
  • 356
  • 2
  • 8
  • Note that Counter() can take a list as input, so if W is a list of words: `counts = Counter(W)` will do the trick – Lars GJ Jul 15 '22 at 08:54
4

You are almost there! You can index the FreqDist using the word you are interested in. Try the following:

print fdist['students']
print fdist['ayre']
print fdist['full']

This gives you the count or number of occurrences of each word. You said "how frequently" - frequency is different to number of occurrences - and that can got like this:

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')
Spaceghost
  • 6,835
  • 3
  • 28
  • 42
4

Most people would just use a defaultdictionary (with a default value of 0). Every time you see a word, just increment the value by one:

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))
Chris Eberle
  • 47,994
  • 12
  • 82
  • 119
1

You can read a file and then tokenize and put the individual tokens into a FreqDist object in NLTK, see http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[out]:

'blah' occurred 3 times

Alternatively, you can use a native Counter object from collections and you get the same counts, see https://docs.python.org/2/library/collections.html. Note that the keys in the FreqDist or Counter object are case sensitive, so you might also want to lowercase your tokenize:

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"
alvas
  • 115,346
  • 109
  • 446
  • 738