This is related to following questions -
- Python unicode equal comparison failed
- Find word infront and behind of a Python list
- Searching for Unicode characters in Python
- NLTK Context Free Grammar Genaration
I have python app doing following tasks -
# -*- coding: utf-8 -*-
1. Read unicode text file (non-english) -
def readfile(file, access, encoding):
with codecs.open(file, access, encoding) as f:
return f.read()
text = readfile('teststory.txt','r','utf-8-sig')
This returns given text file as string.
2. Split text into sentences.
3. Go through words in each sentence and identify verbs, nouns etc.
Refer - Searching for Unicode characters in Python and Find word infront and behind of a Python list
4. Add them into separate variables as below
nouns = "CAR" | "BUS" |
verbs = "DRIVES" | "HITS"
5. Now I'm trying to pass them into NLTK context free grammer as below -
grammar = nltk.parse_cfg('''
S -> NP VP
NP -> N
VP -> V | NP V
N -> '''+nouns+'''
V -> '''+verbs+'''
''')
It gives me following error -
line 40, in V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)
How can i overcome this matter and pass variable into NLTK CFG ?
Complete Code - https://dl.dropboxusercontent.com/u/4959382/new.zip