TL;DR
Most probably this is what you need:
[out]:
austen-emma.txt Counter({u'could': 825, u'would': 815, u'must': 564, u'will': 559, u'should': 366, u'might': 322, u'can': 270, u'may': 213})
austen-persuasion.txt Counter({u'could': 444, u'would': 351, u'must': 228, u'should': 185, u'might': 166, u'will': 162, u'can': 100, u'may': 87})
austen-sense.txt Counter({u'could': 568, u'would': 507, u'will': 354, u'must': 279, u'should': 228, u'might': 215, u'can': 206, u'may': 169})
bible-kjv.txt Counter({u'will': 3807, u'may': 1024, u'should': 768, u'might': 475, u'would': 443, u'can': 213, u'could': 165, u'must': 131})
blake-poems.txt Counter({u'can': 20, u'should': 6, u'may': 5, u'would': 3, u'could': 3, u'will': 3, u'might': 2, u'must': 2})
bryant-stories.txt Counter({u'could': 154, u'will': 144, u'would': 110, u'can': 75, u'must': 39, u'should': 38, u'might': 23, u'may': 18})
burgess-busterbrown.txt Counter({u'could': 56, u'would': 46, u'can': 23, u'will': 19, u'might': 17, u'must': 14, u'should': 13, u'may': 3})
carroll-alice.txt Counter({u'could': 73, u'would': 70, u'can': 57, u'must': 41, u'might': 28, u'should': 27, u'will': 24, u'may': 11})
chesterton-ball.txt Counter({u'will': 198, u'would': 139, u'can': 131, u'could': 117, u'may': 90, u'must': 81, u'should': 75, u'might': 69})
chesterton-brown.txt Counter({u'could': 170, u'would': 132, u'can': 126, u'will': 111, u'might': 71, u'must': 70, u'should': 56, u'may': 47})
chesterton-thursday.txt Counter({u'could': 148, u'can': 117, u'would': 116, u'will': 109, u'might': 71, u'may': 56, u'should': 54, u'must': 48})
edgeworth-parents.txt Counter({u'will': 517, u'would': 503, u'could': 420, u'can': 340, u'should': 271, u'must': 250, u'may': 160, u'might': 127})
melville-moby_dick.txt Counter({u'would': 421, u'will': 379, u'must': 282, u'may': 230, u'can': 220, u'could': 215, u'might': 183, u'should': 181})
milton-paradise.txt Counter({u'will': 161, u'may': 116, u'can': 107, u'might': 98, u'must': 66, u'could': 62, u'should': 55, u'would': 49})
shakespeare-caesar.txt Counter({u'will': 129, u'would': 40, u'should': 38, u'may': 35, u'must': 30, u'could': 18, u'can': 16, u'might': 12})
shakespeare-hamlet.txt Counter({u'will': 131, u'would': 60, u'may': 56, u'must': 53, u'should': 52, u'can': 33, u'might': 28, u'could': 26})
shakespeare-macbeth.txt Counter({u'will': 62, u'would': 42, u'should': 41, u'must': 33, u'may': 30, u'can': 21, u'could': 15, u'might': 5})
whitman-leaves.txt Counter({u'will': 261, u'can': 88, u'would': 85, u'may': 85, u'must': 63, u'could': 49, u'should': 42, u'might': 26})
And to put them in a table:
fileids would may could should will can might must
austen-emma.txt 815 213 825 366 559 270 322 564
austen-persuasion.txt 351 87 444 185 162 100 166 228
austen-sense.txt 507 169 568 228 354 206 215 279
bible-kjv.txt 443 1024 165 768 3807 213 475 131
blake-poems.txt 3 5 3 6 3 20 2 2
bryant-stories.txt 110 18 154 38 144 75 23 39
burgess-busterbrown.txt 46 3 56 13 19 23 17 14
carroll-alice.txt 70 11 73 27 24 57 28 41
chesterton-ball.txt 139 90 117 75 198 131 69 81
chesterton-brown.txt 132 47 170 56 111 126 71 70
chesterton-thursday.txt 116 56 148 54 109 117 71 48
edgeworth-parents.txt 503 160 420 271 517 340 127 250
melville-moby_dick.txt 421 230 215 181 379 220 183 282
milton-paradise.txt 49 116 62 55 161 107 98 66
shakespeare-caesar.txt 40 35 18 38 129 16 12 30
shakespeare-hamlet.txt 60 56 26 52 131 33 28 53
shakespeare-macbeth.txt 42 30 15 41 62 21 5 33
whitman-leaves.txt 85 85 49 42 261 88 26 63
In long:
First let's look at how the FreqDist
works, Difference between Python's collections.Counter and nltk.probability.FreqDist
FreqDist
is basically a collections.Counter
object, so that we can feed it a list and it counts the instances in the list:
>>> from collections import Counter
>>> from nltk import FreqDist
>>> alist = [1,2,1,2,3,4,5,6,7,2,4,5,6,9]
>>> Counter(alist)
Counter({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})
>>> FreqDist(alist)
FreqDist({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})
Now to the gutenberg corpus in nltk
. the .words()
function returns a list of words found in the corpus given the respective filenames, e.g.:
>>> for fileid in gutenberg.fileids():
... print fileid
... print gutenberg.words(fileid)
... break
...
austen-emma.txt
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]
So if we can use the FreqDist
initialization to count the words in austen-emma.txt
.
Now to filter the words in the FreqDist
, there are 2 strategies:
Count all the words in the files and then slowly extract the counts of the modal words you're interested in
Count only the modal words and ignore the other words when initializing the Counter
object.
E.g. let's say our words are numbers and we are interested only in 1,2,8
:
>>> words = [1,1,2,3,2,3,4,5,6,7,8,2,5]
>>> Counter(words)
Counter({2: 3, 1: 2, 3: 2, 5: 2, 4: 1, 6: 1, 7: 1, 8: 1})
>>> interested_words = [1,2,8]
>>> counted = Counter(words)
>>> counted[1]
2
>>> counted[2]
3
>>> counted[8]
1
The alternative is to only count those words, we can use a list comprehension to filter the words, e.g. :
>>> filtered_words = [word for word in words if word in interested_words]
>>> Counter(filtered_words)
Counter({2: 3, 1: 2, 8: 1})
See http://www.pythonforbeginners.com/basics/list-comprehensions-in-python
To the tabulating part of the question, now we'll see why FreqDist is a fancy but useful object.
The .tabulate()
function puts the keys in the FreqDist in the first row and the values (i.e. the counts) in the second row, e.g.:
>>> FreqDist(filtered_words)
FreqDist({2: 3, 1: 2, 8: 1})
>>> FreqDist(filtered_words).tabulate()
2 1 8
3 2 1
Unfortunately, there are no customization function as to how the .tabulate()
prints the table. So you would have to write your own if you need stuff like having the first column as the fileids, etc.
So let's say, if you have a row from the FreqDist object and you would like to print them out, you into a tab-separated string you could do this:
>>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
2 3 1
Let's say you need to add the rowid to the first column, you could do this:
>>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))
blahblah 2 3 1
So if you have multiple rows:
>>> rowid_values = [('row1', FreqDist({2: 3, 1: 2, 8: 1})) , ('row2', FreqDist({2: 10, 1: 20, 8: 10})) ]
>>> for rowid, _fd in rowid_values:
... print print_row(rowid, _fd)
...
row1 2 3 1
row2 20 10 10
And if you need the header row, you can also print it out:
>>> map(str, interested_words)
['1', '2', '8']
>>> ['rowids'] + map(str, interested_words)
['rowids', '1', '2', '8']
>>> '\t'.join(['rowids'] + map(str, interested_words))
'rowids\t1\t2\t8'
>>> print '\t'.join(['rowids'] + map(str, interested_words))
rowids 1 2 8
To join them up:
>>> print '\t'.join(['rowids'] + map(str, interested_words)); print '\n'.join([print_row(rowid, _fd) for rowid, _fd in rowid_values])
rowids 1 2 8
row1 2 3 1
row2 20 10 10