0

I already use movie_reviews corpus to make sentiment analysis. I replaced the existing text files with Arabic language text files, but I couldn't read and print them; I have a problem at encoding.

My code:

import nltk
from nltk.corpus import movie_reviews

documents = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append([movie_reviews.words(fileid),category])   

print(documents[0])

I have this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
m0nhawk
  • 22,980
  • 9
  • 45
  • 73
Karim
  • 1
  • 2
  • 3
    Possible duplicate of [Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)](http://stackoverflow.com/questions/40621799/python-unicodedecodeerror-ascii-codec-cant-decode-byte-0xef-in-position-0) – DYZ Apr 30 '17 at 22:57
  • i can solve the problem with one text file by determine the path and change encoding to utf , but i couldn't with corpus , could u give me suggestions!!! – Karim Apr 30 '17 at 23:00
  • This is an NLTK thing? Can you post the full stack trace? That looks like a Microsoft byte-order mark (BOM) which suggests that its a problem where a file is opened. – tdelaney Apr 30 '17 at 23:25
  • yes i import movie_reviews as corpus from nltk – Karim Apr 30 '17 at 23:42
  • NO Answers :(((((( – Karim May 01 '17 at 13:11

0 Answers0