How to ignore invalid lines in a file?

Question

I'm iterating over a file

for line in io.TextIOWrapper(readFile, encoding = 'utf8'):

when the file contains the following line

b'"""\xea\x11"\t1664\t507\t137\t2\n'

that generates the following exception

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 3: invalid continuation byte

How can I make my script to ignore such lines and continue with the good ones?

As a side note, why are you using `io.TextIOWrapper(readFile, …)` explicitly, instead of just `open`/`io.open`-ing the file in text mode in the first place? There are occasionally good reasons to do this, but I've seen people doing it for no good reason… — abarnert, Dec 17 '13 at 01:20
@abarnert the reason is explained here http://stackoverflow.com/questions/20601796/how-to-open-an-unicode-text-file-inside-a-zip/20603185?noredirect=1#20603185 — Jader Dias, Dec 17 '13 at 01:25
OK, cool. Meanwhile… why do you want to skip over lines like this? It seems like in an arbitrary file that's got embedded garbage, the chances that the garbage is nicely separated out by lines are pretty slim, so half the time you're going to end up skipping over some non-terminate garbage plus a real line of text. Also, a lot of things are valid UTF-8 but complete nonsense. If you know what the actual format is, it would be a lot better to parse it correctly than to use this heuristic. — abarnert, Dec 17 '13 at 01:29
The file is not garbage, it's http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-1M-1gram-20090715-8.csv.zip — Jader Dias, Dec 17 '13 at 01:40

score 7 · Answer 1 · answered Dec 17 '13 at 01:18

7

Pass the errors='ignore' argument to TextIOWrapper. There are other options available as specified here.

answered Dec 17 '13 at 01:18

kalhartt

3,999
20
25

2

The problem is that this skips over invalid characters silently, so it doesn't give you any way to ignore the whole line (or even know which one you want to ignore). – abarnert Dec 17 '13 at 01:21

score 7 · Accepted Answer · answered Dec 17 '13 at 01:22

If you actually want to ignore the whole line if it has any invalid characters, you will have to know there were invalid characters. Which means you can't use TextIOWrapper, and have to instead decode the lines manually. What you want to do is this:

for bline in readFile:
    try:
        line = bline.decode('utf-8')
    except UnicodeDecodeError:
        continue
    # do stuff with line

However, note that this does not give you the same newline behavior as using a text file; if you need that, you'll need to be explicit about that as well.

score 4 · Answer 3 · answered Dec 17 '13 at 01:19

4

I think you can pass the errors parameter:

io.TextIOWrapper(readfile, encoding='utf-8', errors='ignore')

answered Dec 17 '13 at 01:19

aIKid

26,968
4
39
65

How to ignore invalid lines in a file?

3 Answers3

Linked