4

I am trying to split a chunk at the position of a colon : in NLTK but it seems its a special case. In normal regex I can just put it in [:] no problems.

But in NLTK no matter what I do it does not like it in the regexParser.

from nltk import  RegexpParser

grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>|<NNP.*><\:><VBD>}   # chunk (Rapunzel + : + let) together
    {<NNP>+}                
    <.*>}{<VBD.*>           


"""
cp = RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), (":",":"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

print(cp.parse(sentence))

The above code does make a chunk picking up the colon as a block. <.*>}{<\VBD.*> line splits the chunk made up of (Rapunzel+:+let) at the position before let. if you take out that split and replace with the colon it gives a error

from nltk import  RegexpParser

grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>|<NNP.*><\:><VBD>}   # chunk (Rapunzel + : + let) together
    {<NNP>+}                
    <.*>}{<\:.*>           


"""
cp = RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), (":",":"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

print(cp.parse(sentence))

ValueError: Illegal chunk pattern: >

Can anyone explain how to do this, I tried Google and going through the docs but I am none the wiser. I can deal with this post chunk no problem, but I just got to know why or how. :-)

Kaspar Lee
  • 5,446
  • 4
  • 31
  • 54
yaroze
  • 41
  • 2
  • Good question! To allow people to help you, please give a short (but complete) code sample showing a trivial example of how you use the RegexpParser and get the error. – alexis Oct 15 '16 at 13:20

1 Answers1

0

It seems that NLTK treats second colon for each chunk definition as an indicator to start a new chunk.

For those who get the same error, a workaround is to break down multiple regexes into multiple chunks with the same name.

Let's assume we have the following grammar:

grammar = r"""
  SOME_CHUNK: 
    {<NN><:>}
    {<JJ><:>}          
"""

To fix this, change it to:

grammar = r"""
  SOME_CHUNK: {<NN><:>}
  SOME_CHUNK: {<JJ><:>}          
"""

Unfortunately, this won't work if one is using chinking regex with another colon, like in your example.

To help you solve your specific issue please, post an exact sentence you are trying to parse. From your example it is hard to tell why you need |<NNP.*><\:><VBD> part at all.

sorjef
  • 554
  • 5
  • 15