2

I have a corpus of English sentences

sentences = [
    "Mary had a little lamb.",
    "John has a cute black pup.",
    "I ate five apples."
]

and a grammar (for the sake of simplicity)

grammar = ('''
    NP: {<NNP><VBZ|VBD><DT><JJ>*<NN><.>} # NP
    ''')

I wish to filter out the sentences which don't conform to the grammar. Is there a built-in NLTK function which can achieve this? In the above example, first two sentences follow the pattern of my grammar, but not the last one.

rocx
  • 285
  • 2
  • 5
  • 13

2 Answers2

1

TL;DR

Write a grammar, check that it parses, iterate through the subtrees and look for the non-terminals you're looking for, e.g. NP

See:

Code:

import nltk

grammar = ('''
    NP: {<NNP><VBZ|VBD><DT><JJ>*<NN><.>} # NP
    ''')

sentences = [
    "Mary had a little lamb.",
    "John has a cute black pup.",
    "I ate five apples."
]

def has_noun_phrase(sentence):
    parsed = chunkParser.parse(pos_tag(word_tokenize(sentence)))
    for subtree in parsed:
        if type(subtree) == nltk.Tree and subtree.label() == 'NP':
            return True
    return False

chunkParser = nltk.RegexpParser(grammar)
for sentence in sentences:
    print(has_noun_phrase(sentence))
alvas
  • 115,346
  • 109
  • 446
  • 738
0

NLTK supports POS tagging, you can firstly apply POS tagging to your sentences, and then compare with the pre-defined grammar. Below is an example of using NLTK POS tagging.

enter image description here

Giang Nguyen
  • 450
  • 8
  • 17
  • But that doesn't solve my problem. My grammar has a pre-defined structure and I wish to validate whether the grammar returned by nltk.pos_tag() is _similar_ or not. I could write my own parser to validate my regex grammar against the one returned but I'm looking for an inbuilt validator. – rocx May 10 '19 at 05:56
  • I don't know, maybe you need to do it my your own. Sorry. – Giang Nguyen May 10 '19 at 06:32