Trying to read an extremely large text list in python

Question

I'm currently trying to reduce a large list down in size by removing irrevelent data. I'm currently using

with open("list.txt") as f_line:
    for line in f_line:
       Doing_things()

It is currently working with a smaller scale file but when the larger main file is used it gives the following error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3656: ordinal not in range(128)

Is there another way to read the list into python. Also the file has over 10000 single data points for the list. Thanks for your help.

The size of the file is not the problem. The problem is, that you open the file as if it contains ASCII text and it doesn't. — Matthias, Nov 28 '16 at 16:50
I don't think the problem is with the file size. I used this method for files large as 30 GB+ without a glitch. There might be a problem with the files themselves. — Ébe Isaac, Nov 28 '16 at 16:50
First off "large" is a relative term. 10000 isn't a significant number. From what you've ported though, it appears that it's not related to the file size or memory but rather a problem with encoding. You should convert the lines to utf-8 first. — Alexander Ejbekov, Nov 28 '16 at 16:51
For the error, it might not be caused by the file size, maybe because of some strange character at that line. — ettanany, Nov 28 '16 at 16:51

score 1 · Accepted Answer · answered Nov 28 '16 at 17:04

The cause is probably a 'misunderstanding' about the file encoding. Your python interpreter expects a textfile encoded as ascii, but in truth it's encoded as unicode or latin1. If it contains accented characters it's certainly not an ascii-file.

Which version of python do you use? Python 2 treats text differently than Python 3.

I generally use notepad++ to check which encoding is used in a text file if it's unclear.

Once you know which encoding is used you can specify it as mentioned here like this with open('list.txt', encoding='utf-8') as f_line:

Thanks its working now :) I was using Python 3 and I think it was getting confused trying to deal with emojis in the file. — jacob Bailey, Nov 29 '16 at 00:36

Trying to read an extremely large text list in python

1 Answers1