1

I got a task to work with some files and I need to use NLTK. I work with Harry Potter books and short stories by J. K. Rowling. Some files are opened clerale, I can count words, sentences, etc., but I have a problem. When I try to open big files, I get something like this: https://pp.vk.me/c623420/v623420264/2d8b5/xE66_z6JWUs.jpg

Please, say what matter is.

bakkal
  • 54,350
  • 12
  • 131
  • 107
Katherine
  • 21
  • 2
  • What is your Python and NLTK version? Might be related to this: http://stackoverflow.com/questions/25493720/python-nltk-unicodedecodeerror-ascii-codec-cant-decode-byte – bakkal May 02 '15 at 13:45
  • Python - 2.7, NLTK - 3.0. My OS is Lynux if that matters – Katherine May 02 '15 at 14:24
  • I use: `with open('Harry_*.txt', 'r') as myfile: data = myfile.read().replace('\n', ' ')` After that I split text into sentences and words. Mybe problem is not in size of file but in something else – Katherine May 02 '15 at 14:43

1 Answers1

1

Very likely to be a file encoding issue, since I can't see your code or the file, I suggest you try specify an encoding when you open the file before passing it to NLTK

import io
io.open('harrypotter.txt', encoding='ISO-8859-1')  # Or other encoding of your file
bakkal
  • 54,350
  • 12
  • 131
  • 107