NLTK can't open files (UnicodeDecoreError)

Question

I got a task to work with some files and I need to use NLTK. I work with Harry Potter books and short stories by J. K. Rowling. Some files are opened clerale, I can count words, sentences, etc., but I have a problem. When I try to open big files, I get something like this: https://pp.vk.me/c623420/v623420264/2d8b5/xE66_z6JWUs.jpg

Please, say what matter is.

What is your Python and NLTK version? Might be related to this: http://stackoverflow.com/questions/25493720/python-nltk-unicodedecodeerror-ascii-codec-cant-decode-byte — bakkal, May 02 '15 at 13:45
I use: `with open('Harry_*.txt', 'r') as myfile: data = myfile.read().replace('\n', ' ')` After that I split text into sentences and words. Mybe problem is not in size of file but in something else — Katherine, May 02 '15 at 14:43

score 1 · Accepted Answer · answered May 02 '15 at 14:42

Very likely to be a file encoding issue, since I can't see your code or the file, I suggest you try specify an encoding when you open the file before passing it to NLTK

import io
io.open('harrypotter.txt', encoding='ISO-8859-1')  # Or other encoding of your file

NLTK can't open files (UnicodeDecoreError)

1 Answers1