1

I'm trying to build an RNN model which classifies a review to positive or negative sentiment.

There is a dictionary of vocabs, and in preprocessing, I make a review into some sequence of indexes.
For example,

"This movie was best" --> [2,5,10,3]

When I try to get frequent vocabs and see its contents, I got this error:

num of reviews 100
number of unique tokens : 4761
Traceback (most recent call last):
  File "preprocess.py", line 47, in <module>
    print(vocab)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 10561: ordinal not in range(128)

Code is like below:

from bs4 import BeautifulSoup
reviews = []
for item in os.listdir('imdbdata/train/pos')[:100]:
    with open("imdbdata/train/pos/"+item,'r',encoding='utf-8') as f:
        sample = BeautifulSoup(f.read()).get_text()
    sample = word_tokenize(sample.lower())
    reviews.append(sample)
print("num of reviews", len(reviews))
word_freq = nltk.FreqDist(itertools.chain(*reviews))
print("number of unique tokens : %d"%(len(word_freq.items())))
vocab = word_freq.most_common(vocab_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict((w,i) for i,w in enumerate(index_to_word))
print(vocab)

The question is, how can I get away this UnicodeEncodeError when dealing with NLP problem with Python? Especially when getting some text using the open function.

Peter Kim
  • 419
  • 1
  • 5
  • 10

1 Answers1

1

It looks like your terminal is configured for ASCII. Because the character '\xe9' is outside of the range of ASCII characters (0x00-0x7F) it can not be printed on an ASCII terminal. It also can not be encoded as ASCII:

>>> s = '\xe9'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

You could work around this by explicitly encoding the string at print time and handling encoding errors by replacing unsupported characters with ?:

>>> print(s.encode('ascii', errors='replace'))
b'?'

The character looks like it's the ISO-8859-1 encoding for a small letter e with acute (é).

You can check the encoding used for stdout. In my case it's UTF-8, and I have no problem printing that character:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print('\xe9')
é

You might be able to coerce Python into using a different default encoding; there is some discussion here, but the best way would be to use a terminal that supports UTF-8.

mhawke
  • 84,695
  • 9
  • 117
  • 138