Recursion in nltk's RegexpParser

Question

Based on the grammar in the chapter 7 of the NLTK Book:

grammar = r"""
      NP: {<DT|JJ|NN.*>+} # ...
"""

I want to expand NP (noun phrase) to include multiple NP joined by CC (coordinating conjunctions: and) or , (commas) to capture noun phrases like:

The house and tree
The apple, orange and mango
Car, house, and plane

I cannot get my modified grammar to capture those as a single NP:

import nltk

grammar = r"""
  NP: {<DT|JJ|NN.*>+(<CC|,>+<NP>)?}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

Results in:

(S (NP The/DT house/NN) and/CC (NP tree/NN))

I've tried moving the NP to the beginning: NP: {(<NP><CC|,>+)?<DT|JJ|NN.*>+} but I get the same result

(S (NP The/DT house/NN) and/CC (NP tree/NN))

alvas · Accepted Answer · 2019-04-23T07:12:53.633

Lets start small and capture NP (noun phrases) properly:

import nltk

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[out]:

(S (NP The/DT house/NN) and/CC (NP tree/NN))

Now lets try to catch that and/CC. Simply add a higher level phrase that resuse the <NP> rule:

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  CNP: {<NP><CC><NP>}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[out]:

(S (CNP (NP The/DT house/NN) and/CC (NP tree/NN)))

Now that we catch NP CC NP phrases, lets get a little fancy and see whether it catches commas:

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  CNP: {<NP><CC|,><NP>}
"""

sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

Now we see that it's limited to catching the first left-bounded NP CC|, NP and left the last NP alone.

Since we know that conjunctive phrases have left-bounded conjunction and right bounded NP in English, i.e. CC|, NP, e.g. and the tree, we see that the CC|, NP pattern is repetitive, so we can use that as an intermediate representation.

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  XNP: {<CC|,><NP>}
  CNP: {<NP><XNP>+}
"""

sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[out]:

(S
  (CNP
    (NP The/DT house/NN)
    (XNP ,/, (NP the/DT bear/NN))
    (XNP and/CC (NP tree/NN))))

Ultimately, the CNP (Conjunctive NPs) grammar captures the chained noun phrase conjunction in English, even complicated ones, e.g.

import nltk

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  XNP: {<CC|,><NP>}
  CNP: {<NP><XNP>+}
"""

sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[out]:

(S
  (CNP
    (NP The/DT house/NN)
    (XNP ,/, (NP the/DT bear/NN))
    (XNP ,/, (NP the/DT green/JJ house/NN))
    (XNP and/CC (NP a/DT tree/JJ)))
  went/VBD
  to/TO
  (CNP (NP the/DT park/NN) (XNP or/CC (NP the/DT river/NN)))
  ./.)

And if you're just interested in extracting the noun phrases, from How to Traverse an NLTK Tree object?:

noun_phrases = []

def traverse_tree(tree):
    if tree.label() == 'CNP':
        noun_phrases.append(' '.join([token for token, tag in tree.leaves()]))
    for subtree in tree:
        if type(subtree) == nltk.tree.Tree:
            traverse_tree(subtree)

    return noun_phrases

sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
traverse_tree(chunkParser.parse(tagged))

[out]:

['The house , the bear , the green house and a tree', 'the park or the river']

Also, see Python (NLTK) - more efficient way to extract noun phrases?

Thanks for the detailed step by step response. Do the `X` in `XNP` means something or follows a standard? Thanks for including lingo like "Conjunctive NPs", it allows me to better search for what I want and learn more about this topic. Like I said, I'm interested in a recursive answer (to better understand how recursion works in a grammar), but if no better answer comes in time I'll accept your answer. — Leonel Galán, Apr 23 '19 at 15:49
X is a convention. In grammar government and binding, and also chomskian grammar it's common to represent intermediate branches as X*. Sometimes they also use apostrophe, e.g. `NP'` and `NP''` to represent intermediate structures instead of X — alvas, Apr 24 '19 at 01:50
In recursive grammar, you have "non-terminals" and "terminals", in this case you can treat the POS based NPs as "terminals" and the above XNP as "non-terimals" and the "NP with conjunction" is the TOP terminal that you want to look for in noun phrase chunking (BTW, I made up the "Conjunctive NP", not sure whether it's a proper grammar term) — alvas, Apr 24 '19 at 01:55

Recursion in nltk's RegexpParser

1 Answers1