unicode decode error: how to skip invalid characters

Question

Is there any way to preprocess text files and skip these characters?

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa1 in position 1395: invalid start byte

Sure there is, but do you want to? Wouldn't it be better to use the proper codec for decoding in the first place so it all comes back intact? — Mark Ransom, Dec 12 '14 at 23:54
@MarkRansom I guess the better question is then, how do you find the proper codec given a text file? — Maximus S, Dec 13 '14 at 22:18
@MaximusS since you never told us where the data came from, we can't answer that question. Maybe you can? — Mark Ransom, Dec 13 '14 at 22:30
what's the correct encoding for 0xc0? Nothing in python seems to be able to read this. — Nikhil VJ, Apr 15 '18 at 05:16

score 27 · Accepted Answer · answered Dec 13 '14 at 00:00

27

Try this:

str.decode('utf-8',errors='ignore')

answered Dec 13 '14 at 00:00

Irshad Bhat

While this is the workaround the OP actually asks for, it is nearly always the wrong solution. The correct approach would be to identify the correct encoding, and decode that instead. – tripleee Dec 13 '14 at 08:44
3

How do we read a file having invalid characters into str in the first place? – Nikhil VJ Apr 15 '18 at 05:25
9

@nikhilvj I know this is an old thread but for anyone else who landed here and realized that info is missing from the answer: open the file in binary mode with the `'rb'` flag: `content = open(filename, 'rb').read().decode('utf-8', errors='ignore')` – jez Jul 31 '18 at 15:27
2

Additional information which would be good to have here is that the errors='ignore' can also be added to the open command so that if you are doing a with open you don't have to temporarily go to binary. – Joshua Nov 18 '18 at 21:03

score 4 · Answer 2 · answered Dec 13 '14 at 07:20

4

I think your text file have some special character, so 'utf-8' can't decode.

You need to try using 'ISO-8859-1' instead of 'utf-8'. like this:

   import sys
   reload(sys).setdefaultencoding("ISO-8859-1")

   # put your code here

answered Dec 13 '14 at 07:20

Ve Pham

1

There is no indication that this is the correct codec. All we can know is that the input is in a different encoding than UTF-8; the rest is guesswork. – tripleee Dec 13 '14 at 08:42
Thank you! My file had a byte 0xc0 that was causing pandas to reject it. Hunted all around and this is the only place where an alternative encoding worked! – Nikhil VJ Apr 15 '18 at 05:27

2 Answers2