0

This is related to following questions -

I have python app doing following tasks -

# -*- coding: utf-8 -*-

1. Read unicode text file (non-english) -

def readfile(file, access, encoding):
    with codecs.open(file, access, encoding) as f:
        return f.read()

text = readfile('teststory.txt','r','utf-8-sig')

This returns given text file as string.

2. Split text into sentences.

3. Go through words in each sentence and identify verbs, nouns etc.

Refer - Searching for Unicode characters in Python and Find word infront and behind of a Python list

4. Add them into separate variables as below

nouns = "CAR" | "BUS" |

verbs = "DRIVES" | "HITS"

5. Now I'm trying to pass them into NLTK context free grammer as below -

grammar = nltk.parse_cfg('''
    S -> NP VP
    NP -> N
    VP -> V | NP V

    N -> '''+nouns+'''
    V -> '''+verbs+'''
    ''')

It gives me following error -

line 40, in V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)

How can i overcome this matter and pass variable into NLTK CFG ?

Complete Code - https://dl.dropboxusercontent.com/u/4959382/new.zip

Community
  • 1
  • 1
ChamingaD
  • 2,908
  • 8
  • 35
  • 58
  • Can you show the *full* traceback of the error? – Bakuriu Aug 18 '13 at 07:38
  • I'm using Pycharm. How can i print full traceback ? print_stack() didn't work. Anyway can figure out issue with given exception ? – ChamingaD Aug 19 '13 at 05:20
  • `import logging; try: your-code; except: logging.exception("ouch")` # for clarity, use newlines and indentation instead of `;` – Dima Tisnek Aug 19 '13 at 09:49
  • plese also paste proper code that defines `nouns` and `verbs`. See, `"CAR" | "BUS"` (literally) is not possible in Python, I guess it's some string passed to the parser? – Dima Tisnek Aug 19 '13 at 09:52
  • @qarma I will attach complete code for your reference. nouns and verbs are variables which holds some unicode text in format of "CAR" | "BUS" – ChamingaD Aug 19 '13 at 10:21

1 Answers1

1

Overall you have these strategies:

  • treat input as sequence of bytes, then both input and grammar are utf-8-encoded data (bytes)
  • treat input as sequence of unicode code points, then both input and grammar are unicode.
  • rename unicode code points to ascii, that is use escape sequences.

nltk that is installed with pip, 2.0.4 in my case, doesn't accept unicode directly, but accepts quoted unicode constants, that is all of the following appear to work:

In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar')
Out[26]: <Grammar with 2 productions>

In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8"))
Out[27]: <Grammar with 2 productions>

In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape"))
Out[28]: <Grammar with 2 productions>

Note, that I quoted unicode text and not normal text "€" vs bar.

Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
  • Hmm. how to apply above encoding to my code ? grammar = nltk.parse_cfg(''' S -> NP VP NP -> N | D N | ADJ N | ADJ N P | D N P | D ADJ N P | ADJ N N N N N DET VP -> V | NP V | ADV V N -> '''+nouns+pronouns+''' D -> '''+determiners+''' ADJ -> '''+adjectives+''' ADV -> '''+adverbs+''' P -> '''+prepositions+''' V -> '''+verbs+''' ''') – ChamingaD Aug 19 '13 at 14:36