0

I'm currently writing a program which utilizes the Python NLTK library to determine whether a review is positive or negative. When trying to tokenize and store each word in an array, I keep getting the above error. The lines of code before and up to the error lines are:

from nltk.tokenize import word_tokenize

...

short_pos = open("reviews/pos_reviews.txt", "r").read()
short_neg = open("reviews/neg_reviews.txt", "r").read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append( (r, "neg") )

all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

The second to last line is where it's saying I have an error. If I comment out that line, the error appears on the following line. I'm not sure where this error would arise, as I didn't think I was working with unicode at all. Any help would be appreciated!

T. Douglass
  • 38
  • 1
  • 7

1 Answers1

0

In Python 2.7, try to use io module to specify file encoding, see Difference between io.open vs open in python

Also, context manager is your friend (i.e. with ... as ...), esp. when it comes to I/O https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/

import io

from nltk.tokenize import word_tokenize

documents = []

with io.open("reviews/pos_reviews.txt", "r", encoding="utf8") as fin:
    for line in fin:
        documents.append((line.strip(), "pos"))
alvas
  • 115,346
  • 109
  • 446
  • 738