Firstly, you have to declare the non-terminals, i.e. the words in the lexicon directly into the CFG grammar if you're using nltk.CFG.fromstring
:
import nltk
grammar = nltk.CFG.fromstring(u"""
S -> Verb Noun
Verb -> "κάνω"
Noun -> "ποδήλατο"
""")
parser = nltk.ChartParser(grammar)
print parser.grammar()
[out]:
Grammar with 3 productions (start state = S)
S -> Verb Noun
Verb -> '\u03ba\u03ac\u03bd\u03c9'
Noun -> '\u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf'
Now we look at your user_input
:
>>> print ["κάνω ποδήλατο"]
['\xce\xba\xce\xac\xce\xbd\xcf\x89 \xcf\x80\xce\xbf\xce\xb4\xce\xae\xce\xbb\xce\xb1\xcf\x84\xce\xbf']
You realize that the string is read as bytecode in python 2.x but in python 3.x, it would have been utf8 by default. Now look at it as we decode it to utf8:
>>> print ["κάνω ποδήλατο".decode('utf8')]
[u'\u03ba\u03ac\u03bd\u03c9 \u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf']
Note that u"κάνω ποδήλατο"
would have the same effect as "κάνω ποδήλατο".decode('utf8')` in explicitly decoding the string when you're hardcoding some variable.
Now it looks like how the grammar is read with nltk.CFG.fromstring()
:
# -*- coding: utf-8 -*-
import nltk
grammar = nltk.CFG.fromstring(u"""
S -> Verb Noun
Verb -> "κάνω"
Noun -> "ποδήλατο"
""")
parser = nltk.ChartParser(grammar)
user_input = u"κάνω ποδήλατο".split()
sent = user_input
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
print tree
[out]:
(S (Verb \u03ba\u03b1\u03bd\u03c9) (Noun \u03c0\u03bf\u03b4\u03b7\u03bb\u03b1\u03c4\u03bf))
But i'm not sure whether you see something weird about the output, it's not exactly in unicode but the unicode byte representation:
>>> x = '\u03ba\u03b1\u03bd\u03c9'
>>> print x
\u03ba\u03b1\u03bd\u03c9
>>> print x.decode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> print x.encode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> x = u'\u03ba\u03b1\u03bd\u03c9'
>>> print x
κανω
You would need to do this to retrieve your original unicode (thanks to @Kasra, see How to retrieve my unicode from the unicode byte representation
):
>>> s='\u03ba\u03b1\u03bd\u03c9'
>>> print unicode(s,'unicode_escape')
κανω