Create a table displaying relative frequencies using python NLTK and iterating over 18 texts from Gutenberg Corpus

Question

I need to create a table displaying relative frequencies with which “modals” (can, could, may, might, will, would and should) are used in 18 texts provided by NLTK in the extract from Gutenberg Corpus.

Here is my Code

for fileid in gutenberg.fileids():
    fdist = nltk.FreqDist(for w in gutenberg.words(fileid))
modals = ['can', 'could', 'may', 'might', 'must', 'will','would','should']

I need to tabulate fdist with fileid as " Rows" and modals as "Columns"

Based on the information you provide in your question, it's clear that your code has a bug. You should fix it. — alexis, Mar 20 '17 at 20:25
Seriously, welcome to stackoverflow. Please see the help section for guidance on how to write a good question. (Briefly, you must explain your goal clearly, and show the (relevant!) code you have managed so far. As it stands, your question doesn't provide enough information for anyone to help you. — alexis, Mar 20 '17 at 20:26
Your code isn't even valid Python (even after I fixed the indentation). If this is really the best you can do, you should start by reading a few chapters from the nltk book (and/or your textbook, if different.) — alexis, Mar 21 '17 at 11:42

score 0 · Answer 1 · edited May 23 '17 at 12:17

TL;DR

Most probably this is what you need:

[out]:

austen-emma.txt Counter({u'could': 825, u'would': 815, u'must': 564, u'will': 559, u'should': 366, u'might': 322, u'can': 270, u'may': 213})
austen-persuasion.txt Counter({u'could': 444, u'would': 351, u'must': 228, u'should': 185, u'might': 166, u'will': 162, u'can': 100, u'may': 87})
austen-sense.txt Counter({u'could': 568, u'would': 507, u'will': 354, u'must': 279, u'should': 228, u'might': 215, u'can': 206, u'may': 169})
bible-kjv.txt Counter({u'will': 3807, u'may': 1024, u'should': 768, u'might': 475, u'would': 443, u'can': 213, u'could': 165, u'must': 131})
blake-poems.txt Counter({u'can': 20, u'should': 6, u'may': 5, u'would': 3, u'could': 3, u'will': 3, u'might': 2, u'must': 2})
bryant-stories.txt Counter({u'could': 154, u'will': 144, u'would': 110, u'can': 75, u'must': 39, u'should': 38, u'might': 23, u'may': 18})
burgess-busterbrown.txt Counter({u'could': 56, u'would': 46, u'can': 23, u'will': 19, u'might': 17, u'must': 14, u'should': 13, u'may': 3})
carroll-alice.txt Counter({u'could': 73, u'would': 70, u'can': 57, u'must': 41, u'might': 28, u'should': 27, u'will': 24, u'may': 11})
chesterton-ball.txt Counter({u'will': 198, u'would': 139, u'can': 131, u'could': 117, u'may': 90, u'must': 81, u'should': 75, u'might': 69})
chesterton-brown.txt Counter({u'could': 170, u'would': 132, u'can': 126, u'will': 111, u'might': 71, u'must': 70, u'should': 56, u'may': 47})
chesterton-thursday.txt Counter({u'could': 148, u'can': 117, u'would': 116, u'will': 109, u'might': 71, u'may': 56, u'should': 54, u'must': 48})
edgeworth-parents.txt Counter({u'will': 517, u'would': 503, u'could': 420, u'can': 340, u'should': 271, u'must': 250, u'may': 160, u'might': 127})
melville-moby_dick.txt Counter({u'would': 421, u'will': 379, u'must': 282, u'may': 230, u'can': 220, u'could': 215, u'might': 183, u'should': 181})
milton-paradise.txt Counter({u'will': 161, u'may': 116, u'can': 107, u'might': 98, u'must': 66, u'could': 62, u'should': 55, u'would': 49})
shakespeare-caesar.txt Counter({u'will': 129, u'would': 40, u'should': 38, u'may': 35, u'must': 30, u'could': 18, u'can': 16, u'might': 12})
shakespeare-hamlet.txt Counter({u'will': 131, u'would': 60, u'may': 56, u'must': 53, u'should': 52, u'can': 33, u'might': 28, u'could': 26})
shakespeare-macbeth.txt Counter({u'will': 62, u'would': 42, u'should': 41, u'must': 33, u'may': 30, u'can': 21, u'could': 15, u'might': 5})
whitman-leaves.txt Counter({u'will': 261, u'can': 88, u'would': 85, u'may': 85, u'must': 63, u'could': 49, u'should': 42, u'might': 26})

And to put them in a table:

fileids would   may could   should  will    can might   must
austen-emma.txt 815 213 825 366 559 270 322 564
austen-persuasion.txt   351 87  444 185 162 100 166 228
austen-sense.txt    507 169 568 228 354 206 215 279
bible-kjv.txt   443 1024    165 768 3807    213 475 131
blake-poems.txt 3   5   3   6   3   20  2   2
bryant-stories.txt  110 18  154 38  144 75  23  39
burgess-busterbrown.txt 46  3   56  13  19  23  17  14
carroll-alice.txt   70  11  73  27  24  57  28  41
chesterton-ball.txt 139 90  117 75  198 131 69  81
chesterton-brown.txt    132 47  170 56  111 126 71  70
chesterton-thursday.txt 116 56  148 54  109 117 71  48
edgeworth-parents.txt   503 160 420 271 517 340 127 250
melville-moby_dick.txt  421 230 215 181 379 220 183 282
milton-paradise.txt 49  116 62  55  161 107 98  66
shakespeare-caesar.txt  40  35  18  38  129 16  12  30
shakespeare-hamlet.txt  60  56  26  52  131 33  28  53
shakespeare-macbeth.txt 42  30  15  41  62  21  5   33
whitman-leaves.txt  85  85  49  42  261 88  26  63

In long:

First let's look at how the FreqDist works, Difference between Python's collections.Counter and nltk.probability.FreqDist

FreqDist is basically a collections.Counter object, so that we can feed it a list and it counts the instances in the list:

>>> from collections import Counter
>>> from nltk import FreqDist

>>> alist = [1,2,1,2,3,4,5,6,7,2,4,5,6,9]

>>> Counter(alist)
Counter({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})

>>> FreqDist(alist)
FreqDist({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})

Now to the gutenberg corpus in nltk. the .words() function returns a list of words found in the corpus given the respective filenames, e.g.:

>>> for fileid in gutenberg.fileids():
...     print fileid
...     print gutenberg.words(fileid)
...     break
... 
austen-emma.txt
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]

So if we can use the FreqDist initialization to count the words in austen-emma.txt.

Now to filter the words in the FreqDist, there are 2 strategies:

Count all the words in the files and then slowly extract the counts of the modal words you're interested in
Count only the modal words and ignore the other words when initializing the Counter object.

E.g. let's say our words are numbers and we are interested only in 1,2,8:

>>> words = [1,1,2,3,2,3,4,5,6,7,8,2,5]
>>> Counter(words)
Counter({2: 3, 1: 2, 3: 2, 5: 2, 4: 1, 6: 1, 7: 1, 8: 1})
>>> interested_words = [1,2,8]
>>> counted = Counter(words)
>>> counted[1]
2
>>> counted[2]
3
>>> counted[8]
1

The alternative is to only count those words, we can use a list comprehension to filter the words, e.g. :

>>> filtered_words = [word for word in words if word in interested_words]
>>> Counter(filtered_words)
Counter({2: 3, 1: 2, 8: 1})

See http://www.pythonforbeginners.com/basics/list-comprehensions-in-python

To the tabulating part of the question, now we'll see why FreqDist is a fancy but useful object.

The .tabulate() function puts the keys in the FreqDist in the first row and the values (i.e. the counts) in the second row, e.g.:

>>> FreqDist(filtered_words)
FreqDist({2: 3, 1: 2, 8: 1})
>>> FreqDist(filtered_words).tabulate()
2 1 8 
3 2 1

Unfortunately, there are no customization function as to how the .tabulate() prints the table. So you would have to write your own if you need stuff like having the first column as the fileids, etc.

So let's say, if you have a row from the FreqDist object and you would like to print them out, you into a tab-separated string you could do this:

>>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
2   3   1

Let's say you need to add the rowid to the first column, you could do this:

>>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))

blahblah    2   3   1

So if you have multiple rows:

>>> rowid_values = [('row1', FreqDist({2: 3, 1: 2, 8: 1})) , ('row2', FreqDist({2: 10, 1: 20, 8: 10})) ]
>>> for rowid, _fd in rowid_values:
...     print print_row(rowid, _fd)
... 
row1    2   3   1
row2    20  10  10

And if you need the header row, you can also print it out:

>>> map(str, interested_words)
['1', '2', '8']
>>> ['rowids'] + map(str, interested_words)
['rowids', '1', '2', '8']
>>> '\t'.join(['rowids'] + map(str, interested_words))
'rowids\t1\t2\t8'
>>> print '\t'.join(['rowids'] + map(str, interested_words))
rowids  1   2   8

To join them up:

>>> print '\t'.join(['rowids'] + map(str, interested_words)); print '\n'.join([print_row(rowid, _fd) for rowid, _fd in rowid_values])
rowids  1   2   8
row1    2   3   1
row2    20  10  10

Create a table displaying relative frequencies using python NLTK and iterating over 18 texts from Gutenberg Corpus

1 Answers1