UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)

Question

I'm currently writing a program which utilizes the Python NLTK library to determine whether a review is positive or negative. When trying to tokenize and store each word in an array, I keep getting the above error. The lines of code before and up to the error lines are:

from nltk.tokenize import word_tokenize

...

short_pos = open("reviews/pos_reviews.txt", "r").read()
short_neg = open("reviews/neg_reviews.txt", "r").read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append( (r, "neg") )

all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

The second to last line is where it's saying I have an error. If I comment out that line, the error appears on the following line. I'm not sure where this error would arise, as I didn't think I was working with unicode at all. Any help would be appreciated!

what version of python are you using? – candied_orange Dec 10 '18 at 04:28 — candied_orange, Dec 10 '18 at 04:28
@candied_orange my code is running on v2.7.15 – T. Douglass Dec 10 '18 at 04:31 — T. Douglass, Dec 10 '18 at 04:31

score 0 · Answer 1 · answered Dec 10 '18 at 05:41

In Python 2.7, try to use io module to specify file encoding, see Difference between io.open vs open in python

Also, context manager is your friend (i.e. with ... as ...), esp. when it comes to I/O https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/

import io

from nltk.tokenize import word_tokenize

documents = []

with io.open("reviews/pos_reviews.txt", "r", encoding="utf8") as fin:
    for line in fin:
        documents.append((line.strip(), "pos"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)

1 Answers1