33

when I chunk text, I get lots of codes in the output like NN, VBD, IN, DT, NNS, RB. Is there a list documented somewhere which tells me the meaning of these? I have tried googling nltk chunk code nltk chunk grammar nltk chunk tokens.

But I am not able to find any documentation which explains what these codes mean.

alvas
  • 115,346
  • 109
  • 446
  • 738
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373

4 Answers4

26

The tags that you see are not a result of the chunks but the POS tagging that happens before chunking. It's the Penn Treebank tagset, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> sent = "This is a Foo Bar sentence."
# POS tag.
>>> nltk.pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('Foo', 'NNP'), ('Bar', 'NNP'), ('sentence', 'NN'), ('.', '.')]
>>> tagged_sent = nltk.pos_tag(word_tokenize(sent))
# Chunk.
>>> ne_chunk(tagged_sent)
Tree('S', [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]), ('sentence', 'NN'), ('.', '.')])

To get the chunks look for subtrees within the chunked outputs. From the above output, the Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]) indicates the chunk.

This tutorial site is pretty helpful to explain the chunking process in NLTK: http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf.

For official documentation, see http://www.nltk.org/howto/chunk.html

bfontaine
  • 18,169
  • 13
  • 73
  • 107
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Current links above are defunct. Try: https://www.cs.umd.edu/~nau/cmsc421/part-of-speech-tagging.pdf – mccurcio Jul 28 '20 at 17:23
  • 1
    Try this https://web.archive.org/web/20150412115803/http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf – alvas Jul 29 '20 at 08:52
26

Even though the above links have all kinds. But hope this is still helpful for someone, added a few that are missed on other links.

CC: Coordinating conjunction

CD: Cardinal number

DT: Determiner

EX: Existential there

FW: Foreign word

IN: Preposition or subordinating conjunction

JJ: Adjective

VP: Verb Phrase

JJR: Adjective, comparative

JJS: Adjective, superlative

LS: List item marker

MD: Modal

NN: Noun, singular or mass

NNS: Noun, plural

PP: Preposition Phrase

NNP: Proper noun, singular Phrase

NNPS: Proper noun, plural

PDT: Pre determiner

POS: Possessive ending

PRP: Personal pronoun Phrase

PRP: Possessive pronoun Phrase

RB: Adverb

RBR: Adverb, comparative

RBS: Adverb, superlative

RP: Particle

S: Simple declarative clause

SBAR: Clause introduced by a (possibly empty) subordinating conjunction

SBARQ: Direct question introduced by a wh-word or a wh-phrase.

SINV: Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.

SQ: Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.

SYM: Symbol

VBD: Verb, past tense

VBG: Verb, gerund or present participle

VBN: Verb, past participle

VBP: Verb, non-3rd person singular present

VBZ: Verb, 3rd person singular present

WDT: Wh-determiner

WP: Wh-pronoun

WP: Possessive wh-pronoun

WRB: Wh-adverb

red-devil
  • 1,064
  • 1
  • 20
  • 34
2

As told by Alvas above, these tags are part-of-speech which tells whether a word/phrase is Noun phrase,Adverb,determiner,verb etc...

Here are the POS Tag details you can refer.

Chunking recovers the phrased from the Part of speech tags

You can refer this link for reading for about chunking.

Nishu Tayal
  • 20,106
  • 8
  • 49
  • 101
0

Since no one has mentioned it, you can also add the line nltk.help.upenn_tagset() in your code, which will print out all the POS tags and their meaning!

sniegs
  • 31
  • 6