NLTK CFG Parser fails in parsing words in Portuguese

Question

I'm trying to use NLTK CFG Parser but got the error "Grammar does not cover some of the input words". The code I'm using is:

import nltk
import codecs
strProductions = ''
f = codecs.open('C://nltk_data//corpora//CINTIL_TreeBank//producoes_S.txt', 'r', 
encoding= 'latin-1')
for line in f:    
    strProductions= strProductions + line
f.close()
grammar = nltk.grammar.CFG.fromstring(strProductions)
cp = nltk.ChartParser(grammar)
print grammar

S -> V PNT
V -> 'Choveu'    
NP -> DEM N
PP -> P NP
P -> 'de'
NP -> N_
N_ -> N A
N -> 'crian\\xe7a'

tokens = []    
a = u'criança'
b = '.'
a= a.encode('latin-1')
for tree in cp.parse(tokens):       
    print tree
C:\Anaconda2\lib\site-packages\nltk\grammar.pyc in check_coverage(self, tokens)
629             missing = ', '.join('%r' % (w,) for w in missing)
630             raise ValueError("Grammar does not cover some of the "
--> 631                              "input words: %r." % missing)
632 
633     def _calculate_grammar_forms(self):

ValueError: Grammar does not cover some of the input words:
u"'crian\\xe7a'".

Can someone help me identifying what is happening?

Thanks in advance

What happens if you replace the special character "\\xe7" with a regular ASCII 'c'? (in both grammar and a) — Tomer Levinboim, Mar 06 '16 at 01:32
It works. The issue arises when I use extended characters of Portuguese language, such as ç, á, à, ã, etc. — padovani, Mar 06 '16 at 11:19
take a look at http://stackoverflow.com/questions/27659861/unable-to-process-accented-words-using-nltk-tokeniser/27660196#27660196 — Tomer Levinboim, Mar 06 '16 at 11:26
Same problem. I think that it is something related to CFG class. It looks like it does't work with encoding. — padovani, Mar 06 '16 at 23:36

NLTK CFG Parser fails in parsing words in Portuguese

0 Answers0