3

I am trying to chunk a sentence using NLTK's POS tags as regular expressions. 2 rules are defined to identify phrases, based on the tags of words in the sentence.

Mainly, I wanted to capture the chunk of one or more verbs followed by an optional determiner and then one or more nouns at the end. This is the first rule in definition. But it is not getting captured as Phrase Chunk.

import nltk

## Defining the POS tagger 
tagger = nltk.data.load(nltk.tag._POS_TAGGER)


## A Single sentence - input text value
textv="This has allowed the device to start, and I then see glitches which is not nice."
tagged_text = tagger.tag(textv.split())

## Defining Grammar rules for  Phrases
actphgrammar = r"""
     Ph: {<VB*>+<DT>?<NN*>+}  # verbal phrase - one or more verbs followed by optional determiner, and one or more nouns at the end
     {<RB*><VB*|JJ*|NN*\$>} # Adverbial phrase - Adverb followed by adjective / Noun or Verb
     """

### Parsing the defined grammar for  phrases
actp = nltk.RegexpParser(actphgrammar)

actphrases = actp.parse(tagged_text)

The input to the chunker, tagged_text is as below.

tagged_text Out[7]: [('This', 'DT'), ('has', 'VBZ'), ('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN'), ('to', 'TO'), ('start,', 'NNP'), ('and', 'CC'), ('I', 'PRP'), ('then', 'RB'), ('see', 'VB'), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice.', 'NNP')]

In the final output, only the adverbial phrase ('then see'), that is matching the second rule is being captured. I expected the verbal phrase ('allowed the device') to match with the first rule and get captured as well, but its not.

actphrases Out[8]: Tree('S', [('This', 'DT'), ('has', 'VBZ'), ('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN'), ('to', 'TO'), ('start,', 'NNP'), ('and', 'CC'), ('I', 'PRP'), Tree('Ph', [('then', 'RB'), ('see', 'VB')]), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice.', 'NNP')])

NLTK version used is 2.0.5 (Python 2.7) Any help or suggestion would be greatly appreciated.

Thanks in advance,

Bala.

alvas
  • 115,346
  • 109
  • 446
  • 738
Bala
  • 193
  • 1
  • 9
  • First update your NLTK to 3.1. There's significant changes made since 2.0 and it's necessary to get working code. `sudo apt-get install python-nltk` or `sudo pip install -U nltk`. Then take a look at http://stackoverflow.com/questions/34090734/how-to-use-nltk-regex-pattern-to-extract-a-specific-phrase-chunk/34093919#34093919 – alvas Dec 18 '15 at 17:53

1 Answers1

2

Close but minor changes to your regex will get you your desired output. When you want to get a wildcard using RegexpParser grammar, you should use .* instead of *, e.g. VB.* instead of VB*:

>>> from nltk import word_tokenize, pos_tag, RegexpParser
>>> text = "This has allowed the device to start, and I then see glitches which is not nice."
>>> tagged_text = pos_tag(word_tokenize(text))    
>>> g = r"""
... VP: {<VB.*><DT><NN.*>}
... """
>>> p = RegexpParser(g); p.parse(tagged_text)
Tree('S', [('This', 'DT'), ('has', 'VBZ'), Tree('VP', [('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN')]), ('to', 'TO'), ('start', 'VB'), (',', ','), ('and', 'CC'), ('I', 'PRP'), ('then', 'RB'), ('see', 'VBP'), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice', 'JJ'), ('.', '.')])

Note that you're catching the Tree(AdvP, [('then', 'RB'), ('see', 'VB')]), because the tags are exactly RB and VB. So the wildcard in your grammar (i.e. `"""AdvP: {}""") in this scenario is ignored.

Also, if it's two different types of phrases, it's more advisable to use 2 labels not one. And (i think) end of string after wildcard is sort of redundant, so it's better to:

g = r"""
VP:{<VB.*><DT><NN.*>} 
AdvP: {<RB.*><VB.*|JJ.*|NN.*>}
"""
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Great.. Thanks @alvas, for the help, I have removed the redundant end of string as well. Its working now.And regarding the phrase labels, I intended to keep it same, as I would like to filter out both the phrase types into one same list based on `subtree.label` later. If I mention more than one label within the list like `if subtree.node in ("VP","AdvP")` the list is unhashable. Thats why i preferred to keep a same label for both the type of phrases. Will that bring any issue in parsing. ? – Bala Dec 19 '15 at 03:54
  • The rules mighjt be hierarchical if they are the same label. Not sure what it will do when parsing your data but it's should work I guess. – alvas Dec 19 '15 at 06:58
  • Is there a definitive guide to chunking with the NLTK regex chunker? I've had mediocre success on my data when changing the above to "{*}". Is the NLTK chapter (http://www.nltk.org/howto/chunk.html) really the best source? – RandomTask May 13 '18 at 21:41
  • I am also looking for a comprehensive resource for NLTK regex chunker, but the best I found so far is https://www.guru99.com/pos-tagging-chunking-nltk.html. I am still looking for a more comprehensive resource. – jiggysoo Aug 12 '20 at 08:51