1

I am playing around with NLTK and the module freqDist

import nltk
from nltk.corpus import gutenberg
print(gutenberg.fileids())
from nltk import FreqDist
fd = FreqDist()

for word in gutenberg.words('austen-persuasion.txt'):
    fd[word] += 1

newfd = sorted(fd, key=fd.get, reverse=True)[:10]

So I am playing around with NLTK and have a question regarding the sort portion. When I run the code like this it properly sorts the freqDist object. However when I run it with get() instead of get I encounter the error

Traceback (most recent call last):
  File "C:\Python34\NLP\NLP.py", line 21, in <module>
newfd = sorted(fd, key=fd.get(), reverse=True)[:10]
TypeError: get expected at least 1 arguments, got 0

Why is get right and get() wrong. I was under the impression that get() should be correct, but I guess it is not.

alvas
  • 115,346
  • 109
  • 446
  • 738
Bao Dinh
  • 11
  • 1
  • 2
  • Most probably what you need is `fd.most_common()`. Essentially, FreqDist in NLTK is a collections.Counter, see http://stackoverflow.com/questions/34603922/difference-between-pythons-collections-counter-and-nltk-probability-freqdist/34606637#34606637 – alvas May 25 '16 at 06:15

1 Answers1

5

Essentially, the FreqDist object in NLTK is a sub-class of the native Python's collections.Counter, so let's see how Counter works:

A Counter is a dictionary which stores the elements in a list as its key and the counts of the elements as the values:

>>> from collections import Counter
>>> Counter(['a','a','b','c','c','c','d'])
Counter({'c': 3, 'a': 2, 'b': 1, 'd': 1})
>>> c = Counter(['a','a','b','c','c','c','d'])

To get a list of elements sorted by their frequency, you can use .most_common() function and it will return a tuple of the element and its count sorted by the counts.

>>> c.most_common()
[('c', 3), ('a', 2), ('b', 1), ('d', 1)]

And in reverse:

>>> list(reversed(c.most_common()))
[('d', 1), ('b', 1), ('a', 2), ('c', 3)]

Like a dictionary you can iterate through a Counter object and it will return the keys:

>>> [key for key in c]
['a', 'c', 'b', 'd']
>>> c.keys()
['a', 'c', 'b', 'd']

You can also use the .items() function to get a tuple of the keys and their values:

>>> c.items()
[('a', 2), ('c', 3), ('b', 1), ('d', 1)]

Alternatively, if you only need the keys sorted by their counts, see Transpose/Unzip Function (inverse of zip)?:

>>> k, v = zip(*c.most_common())
>>> k
('c', 'a', 'b', 'd')

Going back to the question of .get vs .get(), the former is the function itself, while the latter is an instance of the function that requires the key of the dictionary as its parameter:

>>> c = Counter(['a','a','b','c','c','c','d'])
>>> c
Counter({'c': 3, 'a': 2, 'b': 1, 'd': 1})
>>> c.get
<built-in method get of Counter object at 0x7f5f95534868>
>>> c.get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: get expected at least 1 arguments, got 0
>>> c.get('a')
2

When invoking the sorted(), the key=... parameter inside the sorted function is not the key of the list/dictionary you're sorting but the key that sorted should use for sorting.

So these are the same, but they only return the values of the keys:

>>> [c.get(key) for key in c]
[2, 3, 1, 1]
>>> [c[key] for key in c]
[2, 3, 1, 1]

And when sorting, the values are used as the criteria for sorting, so these achieves the same output:

>>> sorted(c, key=c.get)
['b', 'd', 'a', 'c']
>>> v, k = zip(*sorted((c.get(key), key) for key in c))
>>> list(k)
['b', 'd', 'a', 'c']
>>> sorted(c, key=c.get, reverse=True) # Highest to lowest
['c', 'a', 'b', 'd']
>>> v, k = zip(*reversed(sorted((c.get(key), key) for key in c)))
>>> k
('c', 'a', 'd', 'b')
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Very Nice Explanation !!. The "c.most_common( )" method above is my favorite, it goes directly to the point. Note : this is already assumed, but if some people skip it, you still have to loop first as shown in the question above, than print out the "c.most_common( )" – Calculate Feb 06 '22 at 00:31