You can read a file and then tokenize and put the individual tokens into a FreqDist
object in NLTK
, see http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html
from nltk.probability import FreqDist
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
for word in word_tokenize(fin.read()):
fdist.inc(word)
print "'blah' occurred", fdist['blah'], "times"
[out]:
'blah' occurred 3 times
Alternatively, you can use a native Counter
object from collections
and you get the same counts, see https://docs.python.org/2/library/collections.html. Note that the keys in the FreqDist or Counter object are case sensitive, so you might also want to lowercase your tokenize:
from collections import Counter
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
fdist.update(word_tokenize(fin.read().lower()))
print "'blah' occurred", fdist['blah'], "times"