Counting sentence clauses with NLTK

Question

I'm very new to NLP and currently trying my luck with the Python NLTK. One of the more confusing things about NLTK is grammar construction. In the examples provided in the NLTK book the grammar is written specifically for each sentence under analysis.

grammar1 = nltk.CFG.fromstring("""


 S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)


sent = "Mary saw Bob".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
    print(tree)
(S (NP Mary) (VP (V saw) (NP Bob)))

I want to analyse a huge range of newspaper articles and obviously writing a dedicated grammar for every sentence is not a feasible task. Specifically, I need to know the number of clauses per sentence. Is there an already existing grammar for such a task, or if not how would one go about wring one?

All my sentences are parsed and POS tagged – e.g.,

[(u'Her', 'PRP$'),
 (u'first', 'JJ'),
 (u'term', 'NN'),
 (u'followed', 'VBN'),
 (u'a', 'DT'),
 (u'string', 'NN'),
 (u'of', 'IN'),
 (u'high', 'JJ'),
 (u'profile', 'NN'),
 (u'police', 'NNS'),
 (u'abuse', 'VBP'),
 (u'cases', 'NNS'),
 (u'including', 'VBG'),
 (u'the', 'DT'),
 (u'choking', 'NN'),
 (u'death', 'NN'),
 (u'of', 'IN'),
 (u'a', 'DT'),
 (u'Hispanic', 'NNP'),
 (u'man', 'NN'),
 (u'in', 'IN'),
 (u'1994', 'CD'),
 (u'the', 'DT'),
 (u'Louima', 'NNP'),
 (u'case', 'NN'),
 (u'in', 'IN'),
 (u'1997', 'CD'),
 (u'and', 'CC'),
 (u'the', 'DT'),
 (u'shooting', 'NN'),
 (u'deaths', 'NNS'),
 (u'of', 'IN'),
 (u'a', 'DT'),
 (u'West', 'NNP'),
 (u'African', 'NNP'),
 (u'immigrant', 'NN'),
 (u'in', 'IN'),
 (u'1999', 'CD'),
 (u'and', 'CC'),
 (u'a', 'DT'),
 (u'black', 'JJ'),
 (u'security', 'NN'),
 (u'guard', 'NN'),
 (u'in', 'IN'),
 (u'early', 'JJ'),
 (u'2000', 'CD')]

Did you mean _all sentences are **tokenized** and POS tagged_? — b3000, Aug 04 '15 at 08:16

score 2 · Answer 1 · edited May 23 '17 at 12:24

nltk.data provides some ready to use grammars.

An example CFG grammar:

>>> grammar = nltk.data.load('grammars/large_grammars/atis.cfg')
>>> grammar
<Grammar with 5517 productions>

Note that the nltk.parse package also provides interfaces to parsers like the Stanford Parser or the BLLIP Parser.

Also, maybe have a look at other questions here. So like: English grammar for parsing in NLTK or Stanford Parser and NLTK

Counting sentence clauses with NLTK

1 Answers1