2

I'm currently trying to reduce a large list down in size by removing irrevelent data. I'm currently using

with open("list.txt") as f_line:
    for line in f_line:
       Doing_things()

It is currently working with a smaller scale file but when the larger main file is used it gives the following error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3656: ordinal not in range(128)

Is there another way to read the list into python. Also the file has over 10000 single data points for the list. Thanks for your help.

  • 5
    The size of the file is not the problem. The problem is, that you open the file as if it contains ASCII text and it doesn't. – Matthias Nov 28 '16 at 16:50
  • I don't think the problem is with the file size. I used this method for files large as 30 GB+ without a glitch. There might be a problem with the files themselves. – Ébe Isaac Nov 28 '16 at 16:50
  • 1
    First off "large" is a relative term. 10000 isn't a significant number. From what you've ported though, it appears that it's not related to the file size or memory but rather a problem with encoding. You should convert the lines to utf-8 first. – Alexander Ejbekov Nov 28 '16 at 16:51
  • 1
    @ettanay: A file object is iterable. – Matthias Nov 28 '16 at 16:51
  • For the error, it might not be caused by the file size, maybe because of some strange character at that line. – ettanany Nov 28 '16 at 16:51
  • The file might be currupt – Ébe Isaac Nov 28 '16 at 16:52

1 Answers1

1

The cause is probably a 'misunderstanding' about the file encoding. Your python interpreter expects a textfile encoded as ascii, but in truth it's encoded as unicode or latin1. If it contains accented characters it's certainly not an ascii-file.

Which version of python do you use? Python 2 treats text differently than Python 3.

I generally use notepad++ to check which encoding is used in a text file if it's unclear.

Once you know which encoding is used you can specify it as mentioned here like this with open('list.txt', encoding='utf-8') as f_line:

Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36
  • Thanks its working now :) I was using Python 3 and I think it was getting confused trying to deal with emojis in the file. – jacob Bailey Nov 29 '16 at 00:36