1

I want to create a very simple context-free grammar for Greek language, using nltk. I run Python 2.7 on Windows.

Here's my code:

# -*- coding: utf-8 -*-
import nltk
grammar = nltk.CFG.fromstring("""
            S -> Verb Noun
            Verb -> a
            Noun -> b
            """)
a="κάνω"
b="ποδήλατο"

user_input = "κάνω ποδήλατο"

How can I tell if the user_input is grammatically correct? I tried:

sent =  user_input.split()
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
        print tree

but I get the following error, which occurs in the grammar.py file (line 632), that comes with nltk:

ValueError: Grammar does not cover some of the input words: u"'\\xce\\xba\\xce\\xac\\xce\\xbd\\xcf\\x89', '\\xcf\\x80\\xce\\xbf\\xce\\xb4\\xce\\xae\\xce\\xbb\\xce\\xb1\\xcf\\x84\\xce\\xbf'".

I only get the error when I use the for loop. Until that point I receive no error. So I suppose it's some kind of encoding problem which I don't know how to overcome.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Kostis
  • 11
  • 1

1 Answers1

2

Firstly, you have to declare the non-terminals, i.e. the words in the lexicon directly into the CFG grammar if you're using nltk.CFG.fromstring:

import nltk
grammar = nltk.CFG.fromstring(u"""
            S -> Verb Noun
            Verb -> "κάνω"
            Noun -> "ποδήλατο"
            """)
parser = nltk.ChartParser(grammar)
print parser.grammar()

[out]:

Grammar with 3 productions (start state = S)
    S -> Verb Noun
    Verb -> '\u03ba\u03ac\u03bd\u03c9'
    Noun -> '\u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf'

Now we look at your user_input:

>>> print ["κάνω ποδήλατο"]
['\xce\xba\xce\xac\xce\xbd\xcf\x89 \xcf\x80\xce\xbf\xce\xb4\xce\xae\xce\xbb\xce\xb1\xcf\x84\xce\xbf']

You realize that the string is read as bytecode in python 2.x but in python 3.x, it would have been utf8 by default. Now look at it as we decode it to utf8:

>>> print ["κάνω ποδήλατο".decode('utf8')]
[u'\u03ba\u03ac\u03bd\u03c9 \u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf']

Note that u"κάνω ποδήλατο" would have the same effect as "κάνω ποδήλατο".decode('utf8')` in explicitly decoding the string when you're hardcoding some variable.

Now it looks like how the grammar is read with nltk.CFG.fromstring():

# -*- coding: utf-8 -*-

import nltk
grammar = nltk.CFG.fromstring(u"""
            S -> Verb Noun
            Verb -> "κάνω"
            Noun -> "ποδήλατο"
            """)
parser = nltk.ChartParser(grammar)

user_input = u"κάνω ποδήλατο".split()
sent = user_input
parser = nltk.ChartParser(grammar)

for tree in parser.parse(sent):
    print tree

[out]:

(S (Verb \u03ba\u03b1\u03bd\u03c9) (Noun \u03c0\u03bf\u03b4\u03b7\u03bb\u03b1\u03c4\u03bf))

But i'm not sure whether you see something weird about the output, it's not exactly in unicode but the unicode byte representation:

>>> x = '\u03ba\u03b1\u03bd\u03c9'
>>> print x
\u03ba\u03b1\u03bd\u03c9
>>> print x.decode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> print x.encode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> x = u'\u03ba\u03b1\u03bd\u03c9'
>>> print x
κανω

You would need to do this to retrieve your original unicode (thanks to @Kasra, see How to retrieve my unicode from the unicode byte representation ):

>>> s='\u03ba\u03b1\u03bd\u03c9'
>>> print unicode(s,'unicode_escape')
κανω
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    It works until I reach to the point where i have to retrieve the original unicode. I use Sublime Text 3 and I get UnicodeEncodeError. It works if I write the code in the IDLE though. – Kostis Jan 02 '15 at 13:35
  • Maybe it's because of the default encoding. Take a look at this: http://stackoverflow.com/questions/27659861/unable-to-process-accented-words-using-nltk-tokeniser/27660196#27660196. I assume that the input data will be entered through stdin using raw_input. Suggestion: use python 3. Can you post the error traceback from Sublime? – alvas Jan 02 '15 at 13:53