corpus utf-8 encoding using python

Asked Apr 30 '17 at 22:51

Active May 01 '17 at 06:41

Viewed 132 times

I already use movie_reviews corpus to make sentiment analysis. I replaced the existing text files with Arabic language text files, but I couldn't read and print them; I have a problem at encoding.

My code:

import nltk
from nltk.corpus import movie_reviews

documents = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append([movie_reviews.words(fileid),category])   

print(documents[0])

I have this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

edited May 01 '17 at 06:41

m0nhawk

22,980
9
45
73

asked Apr 30 '17 at 22:51

Karim

3

Possible duplicate of [Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)](http://stackoverflow.com/questions/40621799/python-unicodedecodeerror-ascii-codec-cant-decode-byte-0xef-in-position-0) – DYZ Apr 30 '17 at 22:57
i can solve the problem with one text file by determine the path and change encoding to utf , but i couldn't with corpus , could u give me suggestions!!! – Karim Apr 30 '17 at 23:00
This is an NLTK thing? Can you post the full stack trace? That looks like a Microsoft byte-order mark (BOM) which suggests that its a problem where a file is opened. – tdelaney Apr 30 '17 at 23:25
yes i import movie_reviews as corpus from nltk – Karim Apr 30 '17 at 23:42
NO Answers :(((((( – Karim May 01 '17 at 13:11

corpus utf-8 encoding using python

0 Answers0