How do I find out the most frequent word and word class in a specific brown corpus?

Question

If I am inside the news category,

nltk.corpus.brown.tagged_words(categories="news")

How am I able to find the most frequent word and word class? I am also not allowed to use FreqDist, so that is why it is hard.

score 0 · Answer 1 · edited May 23 '17 at 12:09

Firstly, use namespaces, see https://docs.python.org/3.5/tutorial/modules.html#importing-from-a-package, e.g.:

# We are not Java ;P
# Try not to do nltk.corpus.brown.tagged_words()
# Instead do this:
from nltk.corpus import brown
words_with_tags = brown.tagged_words()

Next, nltk.probability.FreqDist is essentially a sub-type of native Python's collections.Counter, see Difference between Python's collections.Counter and nltk.probability.FreqDist

If you can't use FreqDist, you can use:

from collections import Counter

The return type of brown.tagged_words() is a list of tuples:

>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words_with_tags[0]
(u'The', u'AT')
>>> words_with_tags[:10]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN')]

To split a list of tuples, see Unpacking a list / tuple of pairs into two lists / tuples:

>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words, tags = zip(*words_with_tags)
>>> words[:10]
(u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of')
>>> tags[:10]
(u'AT', u'NP-TL', u'NN-TL', u'JJ-TL', u'NN-TL', u'VBD', u'NR', u'AT', u'NN', u'IN')

Since this is a homework question, there won't be a full code answer =)

score 0 · Answer 2 · answered Feb 28 '17 at 23:06

 import nltk
 from collections import Counter
 brown = nltk.corpus.brown.tagged_words(categories="news")
 words = [word for line in brown for word in line]
 # the most frequent word class
 print Counter(words).most_common(1)

How do I find out the most frequent word and word class in a specific brown corpus?

2 Answers2