Python: How to compute the top X most frequently used words in an NLTK corpus?

Question

I'm unsure if I've understood correctly how the FreqDist functions works on Python. As I am following a tutorial, I am led to believe that the following code constructs a frequency distribution for a given list of words and calculates the top x frequently used words. (In the example below let corpus be an NLTK corpus and file to be a filename of a file in that corpus)

words = corpus.words('file.txt')
fd_words = nltk.FreqDist(word.lower() for word in words)
fd_words.items()[:x]

However, when I go through the following commands on Python, it seems to suggest otherwise:

>>> from nltk import *
>>> fdist = FreqDist(['hi','my','name','is','my','name'])
>>> fdist
FreqDist({'my': 2, 'name':2, 'is':1, 'hi':1}
>>> fdist.items()
[('is',1),('hi',1),('my',2),('name',2)]
>>> fdist.items[:2]
[('is',1),('hi',1)]

The fdist.items()[:x] method is in fact returning the x least common words?

Can someone tell me if I have done something wrong or if the mistake lies in the tutorial I am following?

You may get some help [from answers here](http://stackoverflow.com/questions/23042699/freqdist-in-nltk-not-sorting-output). Essentially `.items()` is using the stdlib implementation, so it's not sorted. If you want the x most frequent words use: `fdist.most_common(x)` — MrAlexBailey, Jan 29 '16 at 14:11
Note that the sorting behavior of `FreqDist` has changed in NLTK 3. This may explain the confusion. Also: Use `fd_words.most_common()`, without an argument, to get everything in descending frequency order. — alexis, Jan 31 '16 at 11:09
or you could do something pretty as shown here https://plot.ly/python/table/ — Dexter, Mar 21 '17 at 15:58

score 20 · Answer 1 · answered Jan 29 '16 at 14:32

20

By default a FreqDist is not sorted. I think you are looking for most_common method:

from nltk import FreqDist
fdist = FreqDist(['hi','my','name','is','my','name'])
fdist.most_common(2)

Returns:

[('my', 2), ('name', 2)]

answered Jan 29 '16 at 14:32

Jerzy Pawlikowski

1,751
19
21

3

`Counter('hi','my','name','is','my','name']).most_common()` would do the same too ;P. See this: http://stackoverflow.com/questions/34603922/difference-between-pythons-collections-counter-and-nltk-probability-freqdist/34606637#34606637 – alvas Jan 29 '16 at 14:56

Python: How to compute the top X most frequently used words in an NLTK corpus?

1 Answers1