How to solve the UnicodeDecodeError when using stanford parser API in NLTK for python?

Question

I want to use stanford parser using Python, I use Windows 7, I've installed Python 2.7 and nltk 3.0 and I downloaded the stanford parser from the official site.

I got the javahome environment problem which I solved, then I got this error message:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

and I can't find a solution for this problem.

I used the next code :

# -*- coding: utf-8 -*-

from nltk.parse import stanford

parser = stanford.StanfordParser(model_path='C:\Program Files (x86)\stanford-parser-full-2015-01-30\edu\stanford\nlp\models\lexparser\englishPCFG.ser.gz')

sent = 'my name is zim'
parser.parse(sent)

I've looked in stack overflow for a solution but I didn't find one.

see this: http://stackoverflow.com/questions/28365626/how-to-output-nltk-chunks-to-file/28381060#28381060 — alvas, Apr 11 '15 at 21:19
Guys i'm so gratful for your help, i did what you suggested @alvas (it was hard to do what the others suggested xd but still thank you guys for your time :D), i downloaded python 3.3.3 and nltk 3.0.2. Now i'm getting this error: "raise OSError('Java command failed : ' + str(cmd)) OSError: Java command failed :...". it seems like it is a Java command failed error. I've no idea what is this error, it is killing me :p. plz help me to make stanford parser work, i REALLY need it for my project. — ziMtyth, Apr 12 '15 at 00:12
Yes, and i'm using jdk1.8.0_20 and jre1.8.0_20, i tried to add there paths to the environment variable (the Path variable), but still doesn't work. Note that i use another variable (JAVAHOME varibale) which contain the path of the jdk and not the jre. My jdk and jre are installed in C:\Program Files\Java. any suggestions @alvas ?. so sorry for wasting your time . — ziMtyth, Apr 12 '15 at 16:28
A comprehensive answer came in on the same day, but you do not appear to have voted on it, accepted it or replied to it. May I ask why? — halfer, Jan 23 '16 at 11:21

score 2 · Answer 1 · edited May 23 '17 at 12:07

If the os.environ or export paths are set properly as described in this: Stanford Parser and NLTK, then it should be an issue of

specifying the encoding in the NLTK API AND
the encoding of your input string

So the solution would be:

update NLTK to the latest stable version i.e. sudo pip install -U nltk
use python3!!!! or specify the encoding for your string

If you're somehow unable to update your python or NLTK, then:

specify the encoding when using Stanford API in NLTK (because of https://github.com/nltk/nltk/issues/877)
specify the encoding for your string (see How to output NLTK chunks to file?)

It is STRONGLY recommended that you use python3 especially when handling text inputs.

If all else fails, and you only have the old version of NLTK and you must somehow use py2.7, then:

import six
from nltk.parse import stanford

path_to_model = "C:\Program Files (x86)\stanford-parser-full-2015-01-30\edu\stanford\nlp\models\lexparser\englishPCFG.ser.gz"

parser = stanford.StanfordParser(model_path=path_to_model, encoding='utf8')

sent = six.text_type('my name is zim')
parser.parse(sent)

See six docs @ http://pythonhosted.org//six/#six.text_type

score 1 · Answer 2 · answered Apr 11 '15 at 20:39

1

0xe9 isn't a valid ASCII byte, so your englishPCFG.ser.gz must not be ASCII encoded. You'll need to figure out what encoding it's using (probably UTF-8) and tell StanfordParser() about it with the encoding keyword argument.

answered Apr 11 '15 at 20:39

Erin Call

1,764
11
15

that's only part of the problem, the default encoding for NLTK3.0's stanford API is `ascii` it has been changed to 'utf8' in the latest version, see https://github.com/nltk/nltk/issues/877. The other part is how the OP read the string, using python3 and the latest stable version of NLTK resolves the issue. – alvas Apr 11 '15 at 21:16

ziMtyth · Accepted Answer · 2017-09-12T07:40:26.107

I've found what was the problem that caused the error that I've encountered

raise OSError('Java command failed : ' + str(cmd)) OSError: Java command failed :...

This error is due to the bad interpretation of the address in the following instruction :

parser = stanford.StanfordParser(model_path='C:\Program Files (x86)\stanford-parser-full-2015-01-30\edu\stanford\nlp\models\lexparser\englishPCFG.ser.gz').

Python or Java interpreted the ...\nlp\.. as \n lp\..., so as a result, it couldn't find the path.

I've tried a simple solution, I've renamed the folder nlp. And it worked!

How to solve the UnicodeDecodeError when using stanford parser API in NLTK for python?

3 Answers3