8

Is there any way to preprocess text files and skip these characters?

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa1 in position 1395: invalid start byte
Maximus S
  • 10,759
  • 19
  • 75
  • 154

2 Answers2

27

Try this:

str.decode('utf-8',errors='ignore')
Irshad Bhat
  • 8,479
  • 1
  • 26
  • 36
  • While this is the workaround the OP actually asks for, it is nearly always the wrong solution. The correct approach would be to identify the correct encoding, and decode that instead. – tripleee Dec 13 '14 at 08:44
  • 3
    How do we read a file having invalid characters into str in the first place? – Nikhil VJ Apr 15 '18 at 05:25
  • 9
    @nikhilvj I know this is an old thread but for anyone else who landed here and realized that info is missing from the answer: open the file in binary mode with the `'rb'` flag: `content = open(filename, 'rb').read().decode('utf-8', errors='ignore')` – jez Jul 31 '18 at 15:27
  • 2
    Additional information which would be good to have here is that the errors='ignore' can also be added to the open command so that if you are doing a with open you don't have to temporarily go to binary. – Joshua Nov 18 '18 at 21:03
4

I think your text file have some special character, so 'utf-8' can't decode.

You need to try using 'ISO-8859-1' instead of 'utf-8'. like this:

   import sys
   reload(sys).setdefaultencoding("ISO-8859-1")

   # put your code here
Ve Pham
  • 303
  • 1
  • 9
  • 1
    There is no indication that this is the correct codec. All we can know is that the input is in a different encoding than UTF-8; the rest is guesswork. – tripleee Dec 13 '14 at 08:42
  • Thank you! My file had a byte 0xc0 that was causing pandas to reject it. Hunted all around and this is the only place where an alternative encoding worked! – Nikhil VJ Apr 15 '18 at 05:27