0

I am experimenting with some NLP algorithms and I am focusing now on sentiment analysis. For this reason, I downloaded from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html some .review format files with positive and negative reviews.

I am using BeautifulSoup for parsing these XML files and for now I am only trying to read them by executing the following source code:

from bs4 import BeautifulSoup

positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('*******/electronics/negative.review').read())
negative_reviews = negative_reviews.findAll('review_text')

However, I am getting the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 118374: ordinal not in range(128)

when

positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())

is to be executed.

How can I fix this error?

I have also replaced

BeautifulSoup(open('*******/electronics/positive.review').read())

with

BeautifulSoup(open('*******/electronics/positive.review').read().decode('utf-8'))

but I am getting exactly the same error.

Finally, I have already read some relevant posts on StackOverflow but so far nothing worked for me. For example, at my terminal echo $LANG outputs en_GB.UTF-8 as it is described at the first answer of UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1 but I am still getting the error above.

Outcast
  • 4,967
  • 5
  • 44
  • 99
  • Did you try `BeautifulSoup(open('*******/electronics/negative.review').read().decode('utf-8'))`? – Keyur Potdar Jun 12 '18 at 08:39
  • Thank you for your comment. Yes, I have tried it. I am going to add this to my post. – Outcast Jun 12 '18 at 08:41
  • Your `IDE/OS` is ascii but you need an UTF-8 **OUTPUT**. Go to system tab (on OS) and set default encoding to `UTF-8`(before `x.decode("utf-8")`). I/O (Input/Output) always use system encoding (actualy no encoding , only ascii (mean `unsigned char *`)). Like this : `s = u"test" `, `v = "test"` and `s == v` is `True` But `Size s` = 68 and `Size v` = 41 so **bye bye byte position !** – dsgdfg Jun 12 '18 at 09:31
  • Thank you for your comment. However, your comment is a bit too dense. For example, how more specifically shall I do this "Go to system tab (on OS) and set default encoding to UTF-8..."? – Outcast Jun 12 '18 at 09:49
  • try adding `# -*- coding: utf-8 -*-` at beginning of your program – Siva Jun 12 '18 at 09:53
  • Thank you for your comment. No this did not work either. – Outcast Jun 12 '18 at 10:45

1 Answers1

1

If you're using Python 3, try replacing

open('*******/electronics/positive.review')

with

open('*******/electronics/positive.review', encoding='utf-8')
kristaps
  • 1,705
  • 11
  • 15