10

I'm working on a non-English parser with Unicode characters. For that, I decided to use NLTK.

But it requires a predefined context-free grammar as below:

  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with" 

In my app, I am supposed to minimize hard coding with the use of a rule-based grammar. For example, I can assume any word ending with -ed or -ing as a verb. So it should work for any given context.

How can I feed such grammar rules to NLTK? Or generate them dynamically using Finite State Machine?

Student
  • 1,197
  • 4
  • 22
  • 39
ChamingaD
  • 2,908
  • 8
  • 35
  • 58
  • 1
    You may like to read [this answer](http://stackoverflow.com/questions/14096237/can-someone-give-a-simple-but-non-toy-example-of-a-context-sensitive-grammar/14099421#14099421) because you are writing CFG. – Grijesh Chauhan Jul 17 '13 at 18:39
  • Thanks. I looked but couldn't understand it. Is there any way i can feed python variables to CFG ? – ChamingaD Jul 18 '13 at 04:09
  • If you want to automatically learn CFG rules, you can try implementing this www.aclweb.org/anthology/O06-1004‎ =) – alvas Jul 24 '13 at 17:56

5 Answers5

8

If you are creating a parser, then you have to add a step of pos-tagging before the actual parsing -- there is no way to successfully determine the POS-tag of a word out of context. For example, 'closed' can be an adjective or a verb; a POS-tagger will find out the correct tag for you from the context of the word. Then you can use the output of the POS-tagger to create your CFG.

You can use one of the many existing POS-taggers. In NLTK, you can simply do something like:

import nltk
input_sentence = "Dogs chase cats"
text = nltk.word_tokenize(input_sentence)
list_of_tokens = nltk.pos_tag(text)
print list_of_tokens

The output will be:

[('Dogs', 'NN'), ('chase', 'VB'), ('cats', 'NN')]

which you can use to create a grammar string and feed it to nltk.parse_cfg().

dkar
  • 2,113
  • 19
  • 29
  • No, but NLTK allows you to train your own tagger in a very straightforward way. However, in order to do that you're going to need some tagged corpus of your language for training the statistical model. Do you have access to such a resource? What is the language you're working on? – dkar Jul 24 '13 at 16:29
  • I need rule based grammar generation method. For example words ending with -ed or -ing as a verb (in my app i will use unicode character). Is there anyway to do that with NLTK ? – ChamingaD Jul 25 '13 at 05:45
  • 1
    I guess that this means you don't have any tagged corpora in your language? Anyway, if you want a fully rule-based tagger you have to create it yourself (write yourself rules such that "if the word begins with this and ends with that and the previous word is this then the word is an adjective). I don't think that NLTK has a mechanism for that. However, it is still not clear to me what you want exactly to do and why you have to use explicitly rule-based systems. You are welcome of course to provide us a more complete description of your requirements. – dkar Jul 25 '13 at 21:41
  • Appreciate your help. Yes i won't have tagged corpus. But i need some way to identify POS using rules, with minimum hard coding. – ChamingaD Jul 26 '13 at 03:46
3

Maybe you're looking for CFG.fromstring() (formerly parse_cfg())?

From Chapter 7 of the NLTK book (updated to NLTK 3.0):

> grammar = nltk.CFG.fromstring("""
 S -> NP VP
 VP -> V NP | V NP PP
 V -> "saw" | "ate"
 NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
 Det -> "a" | "an" | "the" | "my"
 N -> "dog" | "cat" | "cookie" | "park"
 PP -> P NP
 P -> "in" | "on" | "by" | "with"
 """)

> sent = "Mary saw Bob".split()
> rd_parser = nltk.RecursiveDescentParser(grammar)
> for p in rd_parser.parse(sent):
      print p
(S (NP Mary) (VP (V saw) (NP Bob)))
arturomp
  • 28,790
  • 10
  • 43
  • 72
  • Thanks. But it still hard code those verbs and nouns right ? Is there anyway to pass string value to CFG ? like V = variable_a – ChamingaD Jul 20 '13 at 04:58
  • I'm sure you could concatenate the strings and then pass them in! http://stackoverflow.com/questions/12169839/ – arturomp Jul 21 '13 at 14:54
  • actually, from what I understand in your original question, another thing to try (not entirely sure if it is possible) is to do partial POS tagging only on the words ending in -ing or -ed, and mark them as V, so you don't have to worry about the V rule in your CFG. – arturomp Jul 22 '13 at 15:05
1

You can use NLTK RegexTagger that have regular expression capability of decide token. This is exactly you need need in your case. As token ending with 'ing' will be tagged as gerunds and token ending with 'ed' will be tagged with verb past. see the example below.

patterns = [
    (r'.*ing$', 'VBG'), # gerunds
    (r'.*ed$', 'VBD'), # simple past
    (r'.*es$', 'VBZ'), # 3rd singular present
    (r'.*ould$', 'MD'), # modals
    (r'.*\'s$', 'NN$'), # possessive nouns
    (r'.*s$', 'NNS') # plural nouns
 ]

Note that these are processed in order, and the first one that matches is applied. Now we can set up a tagger and use it to tag a sentence. After this step, it is correct about a fifth of the time.

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(your_sent)

you can use Combining Taggers for using collectively multiple tagger in a sequence.

Sanjiv
  • 1,795
  • 1
  • 29
  • 45
0

You can't write such kind of rules in nltk right now without any effort but you can make some tricks.

For example, transcribe your sentence in some kind of word-informative labels and write your grammar rules accordingly.

For example (using POS tag as label):

Dogs eat bones. 

becomes:

NN V NN.

And grammar terminal rules example:

V -> 'V'

If that's not enough you should take a look for a more flexible formalism implementation.

ermath
  • 11
  • 1
0

Another option is to use a regular expression parser instead. See https://www.nltk.org/book/ch07.html. Something like this:

    >>> import nltk, re, pprint
    >>> from nltk import word_tokenize, sent_tokenize
    >>> my_sentence = "This is just an example"
    >>> tokenized_sentence = word_tokenize(my_sentence)
    >>> tagged_sentence = nltk.pos_tag(tokenized_sentence)
    >>> grammar = """
    ...   P:   {<IN>}
    ...   N:   {<NN.*>}
    ...   DET: {<DT>}
    ...   NP:  {<DET><N><PP>?}
    ...        {<NNP>}
    ...   V:   {<VB.*>}
    ...   PP:  {<P><NP>}
    ...   VP:  {<V><NP>}
    ...        {<V><NP><PP>}
    ...   S:   {<NP><VP>}
    ... """
    >>> cp = nltk.RegexpParser(grammar)
    >>> tree = cp.parse(tagged_sentence)
    >>> print(tree)
    (S (DET This/DT) (V is/VBZ) just/RB (NP (DET an/DT) (N example/NN)))        

The downside is that if you are looking for specific hard coded words, this won't tell you that directly. However, you can process the tree and figure out the words using something like this. The book at the link above describes this.

 for subtree in tree.subtrees():
        if subtree.label() == 'N': 
            noun = subtree[0][0]
            do_something(noun)
David J
  • 1,018
  • 1
  • 9
  • 14