10

I'm unsure if I've understood correctly how the FreqDist functions works on Python. As I am following a tutorial, I am led to believe that the following code constructs a frequency distribution for a given list of words and calculates the top x frequently used words. (In the example below let corpus be an NLTK corpus and file to be a filename of a file in that corpus)

words = corpus.words('file.txt')
fd_words = nltk.FreqDist(word.lower() for word in words)
fd_words.items()[:x]

However, when I go through the following commands on Python, it seems to suggest otherwise:

>>> from nltk import *
>>> fdist = FreqDist(['hi','my','name','is','my','name'])
>>> fdist
FreqDist({'my': 2, 'name':2, 'is':1, 'hi':1}
>>> fdist.items()
[('is',1),('hi',1),('my',2),('name',2)]
>>> fdist.items[:2]
[('is',1),('hi',1)]

The fdist.items()[:x] method is in fact returning the x least common words?

Can someone tell me if I have done something wrong or if the mistake lies in the tutorial I am following?

Wolff
  • 1,051
  • 3
  • 18
  • 31
  • 1
    You may get some help [from answers here](http://stackoverflow.com/questions/23042699/freqdist-in-nltk-not-sorting-output). Essentially `.items()` is using the stdlib implementation, so it's not sorted. If you want the x most frequent words use: `fdist.most_common(x)` – MrAlexBailey Jan 29 '16 at 14:11
  • Note that the sorting behavior of `FreqDist` has changed in NLTK 3. This may explain the confusion. Also: Use `fd_words.most_common()`, without an argument, to get everything in descending frequency order. – alexis Jan 31 '16 at 11:09
  • or you could do something pretty as shown here https://plot.ly/python/table/ – Dexter Mar 21 '17 at 15:58

1 Answers1

20

By default a FreqDist is not sorted. I think you are looking for most_common method:

from nltk import FreqDist
fdist = FreqDist(['hi','my','name','is','my','name'])
fdist.most_common(2)

Returns:

[('my', 2), ('name', 2)]
Jerzy Pawlikowski
  • 1,751
  • 19
  • 21
  • 3
    `Counter('hi','my','name','is','my','name']).most_common()` would do the same too ;P. See this: http://stackoverflow.com/questions/34603922/difference-between-pythons-collections-counter-and-nltk-probability-freqdist/34606637#34606637 – alvas Jan 29 '16 at 14:56