Decoding error in paths using nltk.corpus.gutenberg.fileids()

Question

When I run nltk.corpus.gutenberg.fileids() with Python 2.7 (Anaconda, Windows) I get the following error:

File "C:\Anaconda\lib\ntpath.py", line 85, in join
    result_path = result_path + '\\'

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 9:
ordinal not in range(128)

I don't have this error when I use Python 3.4. Maybe I'm wrong but I suspect the path to contain an accent (as there is an accent in my Windows username).

When I add some print in ntpath.py, nothing is printed I don't know why (?) so I'm unable to debug by myself.

EDIT: The import nltk is enough to get the error.

add this line to the top of your script: `#!/usr/bin/env python -*- coding: utf-8 -*-`, e.g. https://github.com/alvations/pywsd/blob/master/pywsd/lesk.py — alvas, Jul 17 '15 at 19:48
set locale to utf8? http://docs.oracle.com/cd/E23824_01/html/E26033/glmha.html — alvas, Jul 17 '15 at 21:17
import sys ;reload(sys);sys.setdefaultencoding("utf-8") before string processing. Or dont touch name use os.walk() — dsgdfg, Jul 17 '15 at 21:24
@SDilmac Your first solution does not change anything. About `os.walk()` I will study the doc to understand, thank you — clemtoy, Jul 17 '15 at 21:41
or try this http://stackoverflow.com/questions/5974585/python-not-able-to-open-file-with-non-english-characters-in-path i hope helpful — dsgdfg, Jul 17 '15 at 21:50

score 1 · Accepted Answer · answered Jul 21 '15 at 19:37

I'm guessing Python 2 nltk has some issues with non-ASCII paths. Using Python 3 is probably the simplest fix here, at least assuming you don't have too much code that doesn't work in it. It's hard to say for sure, since you didn't include the full traceback, but likely nltk would have to be patched to fix this for Python 2. Otherwise, you would need to avoid paths with non-ASCII characters (meaning avoiding your user directory or changing your username).

Decoding error in paths using nltk.corpus.gutenberg.fileids()

1 Answers1