extracting text from python Regexparser

Question

I am new to NLTK

This is the code I have used,

text="The pizza was 66 and brilliant"
pattern = r"""
P: {<NN>+<VBD>+<CD>+}
"""
for sent in sent_tokenize(text):
  sentence = sent.split()
  PChunker = RegexpParser(pattern)
  output= PChunker.parse(pos_tag(sentence))
  print(output)

I am getting the output,

(S The/DT (P pizza/NN was/VBD 66/CD) and/CC brilliant/VB)

I need the output ,

pizza was 66

How can I get this?

Looks like `output` is a sort of match object. Do the docs have anything about how to get the matched text from it? — Mad Physicist, May 25 '18 at 06:56

score 0 · Answer 1 · answered May 25 '18 at 07:52

The output of RegexpParser.parse is a tree that you can loop through using tree.subtrees. Try the following, to immediately filter for the non-terminal node you are interested in (P in your case):

from nltk import sent_tokenize
from nltk import RegexpParser
from nltk import pos_tag

text="The pizza was 66 and brilliant"
pattern = r"""
P: {<NN>+<VBD>+<CD>+}
"""
for sent in sent_tokenize(text):
  sentence = sent.split()
  PChunker = RegexpParser(pattern)
  output= PChunker.parse(pos_tag(sentence))
  print(output)
  for subtree in output.subtrees(filter=lambda t: t.label() == 'P'):
      print(subtree)
      print(' '.join([x[0] for x in subtree]))

Hello Igor, I've one more doubt. Is it possible to add 'was' directly in pattern. I was trying for an example like this. — user9845038, May 25 '18 at 09:57

extracting text from python Regexparser

1 Answers1