How to tag an article with NLTK?

Question

I'm trying to understand how to tag an article via the NLTK. According to bullet point 2.7 explains how to find the most frequent nouns.

def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())


>>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
>>> for tag in sorted(tagdict):
...     print(tag, tagdict[tag])

But this is really confusing as I don't see how a piece of text is being injected in to find the tags for it. Rather it is using a predefined structure of data (nltk.corpus.brown.tagged_words). Not sure how to proceed here.

score 1 · Accepted Answer · edited May 23 '17 at 12:06

In short:

To tag a text, use nltk.pos_tag but do note its quirks (Python NLTK pos_tag not returning the correct part-of-speech tag):

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "This is a foo bar piece of text. And there are many sentences in this text."
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
>>> tagged_text
[[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('piece', 'NN'), ('of', 'IN'), ('text', 'NN'), ('.', '.')], [('And', 'CC'), ('there', 'EX'), ('are', 'VBP'), ('many', 'JJ'), ('sentences', 'NNS'), ('in', 'IN'), ('this', 'DT'), ('text', 'NN'), ('.', '.')]]

In long:

A list of NLTK datasets can be found by:

>>> import nltk
>>> nltk.download()

And nltk.corpus.brown corpus is one of the most commonly used corpus for Natural Language Processing or text processing (see What is the difference between corpus and lexicon in NLTK (python) for jargons).

In the case of brown corpus, it is a fully tagged and tokenized corpus so all NLTK provided was a reader. To access the various annotations, see section 1.3 at http://www.nltk.org/howto/corpus.html, here's a few examples:

>>> from nltk.corpus import brown
>>> brown.words()[:50]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.', u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise']
>>> brown.tagged_words()[:50]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.'), (u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL'), (u',', u','), (u'which', u'WDT'), (u'had', u'HVD'), (u'over-all', u'JJ'), (u'charge', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'election', u'NN'), (u',', u','), (u'``', u'``'), (u'deserves', u'VBZ'), (u'the', u'AT'), (u'praise', u'NN')]
>>> brown.sents()
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]
>>> brown.sents()[:3]
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], [u'The', u'September-October', u'term', u'jury', u'had', u'been', u'charged', u'by', u'Fulton', u'Superior', u'Court', u'Judge', u'Durwood', u'Pye', u'to', u'investigate', u'reports', u'of', u'possible', u'``', u'irregularities', u"''", u'in', u'the', u'hard-fought', u'primary', u'which', u'was', u'won', u'by', u'Mayor-nominate', u'Ivan', u'Allen', u'Jr.', u'.']]
>>> brown.tagged_sents()[:3]
[[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')], [(u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL'), (u',', u','), (u'which', u'WDT'), (u'had', u'HVD'), (u'over-all', u'JJ'), (u'charge', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'election', u'NN'), (u',', u','), (u'``', u'``'), (u'deserves', u'VBZ'), (u'the', u'AT'), (u'praise', u'NN'), (u'and', u'CC'), (u'thanks', u'NNS'), (u'of', u'IN'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'of', u'IN-TL'), (u'Atlanta', u'NP-TL'), (u"''", u"''"), (u'for', u'IN'), (u'the', u'AT'), (u'manner', u'NN'), (u'in', u'IN'), (u'which', u'WDT'), (u'the', u'AT'), (u'election', u'NN'), (u'was', u'BEDZ'), (u'conducted', u'VBN'), (u'.', u'.')], [(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]]

The structure for:

nltk.corpus.brown.words() is a list of string, where each item in the list is a word
nltk.corpus.brown.tagged_words() is a list of tuples with the first element as the word and the 2nd element in the tuple as the tag
nltk.corpus.sents() is a list of a list of string, where the other list comprises the whole corpus and the inner list is one sentence
nltk.corpus.tagged_sents() is a list of list of tuples, where it's the same as nltk.corpus.sents() but the inner list is a tuple of word and tag.

Thank you so much for the great answer. So Corpus is not just a sample file, it is actually needed to process the tagging? Looking at the nltk folder, even after deleting all zip files, the `corpus` folder alone is 1.4 gb. Is that all needed? Or could I just keep `brown` for best results? How about `models` folder? Is that needed too? 120mb — Houman, Aug 06 '15 at 08:09
Depends on the task, if you only need the brown corpus, you could do `import nltk; nltk.download('brown')`. The other files in the `nltk_data` folder are used for other functions. But most probability, you'll need `punkt` for the `nltk.sent_tokenize` and also `maxent_treebank_pos_tagger` for `nltk.pos_tag` and possibly someday you'll need `universal_tagset` and `snowball_data`. — alvas, Aug 06 '15 at 08:17
But if you like the idea of "batteries included", then just download all the models. As for corpus, it's also very useful if you're interested in using them as training data. I usually do `nltk.download('all')` so that I can get everything =) — alvas, Aug 06 '15 at 08:18
Corpora (plural for corpus) in NLTK are already pre-process and zipped. NLTK provide the API to read them easily. — alvas, Aug 06 '15 at 08:19
Amazing, thanks for the info. As I'm using the code on GoogleAppEngine, Hence there are some limitations and I need to prepare all Libs and data upfront I wish to upload. So when you say they are zipped and NLTK can read them, how about performance? — Houman, Aug 06 '15 at 09:04

How to tag an article with NLTK?

1 Answers1