UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 118374: ordinal not in range(128)

Question

I am experimenting with some NLP algorithms and I am focusing now on sentiment analysis. For this reason, I downloaded from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html some .review format files with positive and negative reviews.

I am using BeautifulSoup for parsing these XML files and for now I am only trying to read them by executing the following source code:

from bs4 import BeautifulSoup

positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('*******/electronics/negative.review').read())
negative_reviews = negative_reviews.findAll('review_text')

However, I am getting the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 118374: ordinal not in range(128)

when

positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())

is to be executed.

How can I fix this error?

I have also replaced

BeautifulSoup(open('*******/electronics/positive.review').read())

with

BeautifulSoup(open('*******/electronics/positive.review').read().decode('utf-8'))

but I am getting exactly the same error.

Finally, I have already read some relevant posts on StackOverflow but so far nothing worked for me. For example, at my terminal echo $LANG outputs en_GB.UTF-8 as it is described at the first answer of UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1 but I am still getting the error above.

Did you try `BeautifulSoup(open('*******/electronics/negative.review').read().decode('utf-8'))`? — Keyur Potdar, Jun 12 '18 at 08:39
Thank you for your comment. Yes, I have tried it. I am going to add this to my post. — Outcast, Jun 12 '18 at 08:41
Your `IDE/OS` is ascii but you need an UTF-8 **OUTPUT**. Go to system tab (on OS) and set default encoding to `UTF-8`(before `x.decode("utf-8")`). I/O (Input/Output) always use system encoding (actualy no encoding , only ascii (mean `unsigned char *`)). Like this : `s = u"test" `, `v = "test"` and `s == v` is `True` But `Size s` = 68 and `Size v` = 41 so **bye bye byte position !** — dsgdfg, Jun 12 '18 at 09:31
Thank you for your comment. However, your comment is a bit too dense. For example, how more specifically shall I do this "Go to system tab (on OS) and set default encoding to UTF-8..."? — Outcast, Jun 12 '18 at 09:49
try adding `# -*- coding: utf-8 -*-` at beginning of your program — Siva, Jun 12 '18 at 09:53

score 1 · Accepted Answer · answered Jun 12 '18 at 13:08

1

If you're using Python 3, try replacing

open('*******/electronics/positive.review')

with

open('*******/electronics/positive.review', encoding='utf-8')

answered Jun 12 '18 at 13:08

kristaps

1,705
11
15

Thanks. I had missed that. :) – Outcast Jun 12 '18 at 14:03

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 118374: ordinal not in range(128)

1 Answers1