I'm very new to NLP and currently trying my luck with the Python NLTK. One of the more confusing things about NLTK is grammar construction. In the examples provided in the NLTK book the grammar is written specifically for each sentence under analysis.
grammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")
sent = "Mary saw Bob".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
print(tree)
(S (NP Mary) (VP (V saw) (NP Bob)))
I want to analyse a huge range of newspaper articles and obviously writing a dedicated grammar for every sentence is not a feasible task. Specifically, I need to know the number of clauses per sentence. Is there an already existing grammar for such a task, or if not how would one go about wring one?
All my sentences are parsed and POS tagged – e.g.,
[(u'Her', 'PRP$'),
(u'first', 'JJ'),
(u'term', 'NN'),
(u'followed', 'VBN'),
(u'a', 'DT'),
(u'string', 'NN'),
(u'of', 'IN'),
(u'high', 'JJ'),
(u'profile', 'NN'),
(u'police', 'NNS'),
(u'abuse', 'VBP'),
(u'cases', 'NNS'),
(u'including', 'VBG'),
(u'the', 'DT'),
(u'choking', 'NN'),
(u'death', 'NN'),
(u'of', 'IN'),
(u'a', 'DT'),
(u'Hispanic', 'NNP'),
(u'man', 'NN'),
(u'in', 'IN'),
(u'1994', 'CD'),
(u'the', 'DT'),
(u'Louima', 'NNP'),
(u'case', 'NN'),
(u'in', 'IN'),
(u'1997', 'CD'),
(u'and', 'CC'),
(u'the', 'DT'),
(u'shooting', 'NN'),
(u'deaths', 'NNS'),
(u'of', 'IN'),
(u'a', 'DT'),
(u'West', 'NNP'),
(u'African', 'NNP'),
(u'immigrant', 'NN'),
(u'in', 'IN'),
(u'1999', 'CD'),
(u'and', 'CC'),
(u'a', 'DT'),
(u'black', 'JJ'),
(u'security', 'NN'),
(u'guard', 'NN'),
(u'in', 'IN'),
(u'early', 'JJ'),
(u'2000', 'CD')]