1

This is an issue that is quite common for me. So common that it seems I'm missing something (i.e., there has to be a better way). There are multiple posts on SO dedicated to this problem, but they seem to be workarounds rather than an actual solution. So, I have a CSV to read in:

df = pd.read_csv(filename.csv)

I get the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 9: invalid continuation byte

The solution is to use cchardet or chardet (cchardet is much much faster):

import cchardet as chardet
with open(filename.csv, 'rb') as f:
    result = chardet.detect(f.read())

df = pd.read_csv(filename.csv, encoding=result['encoding'])

This works but doesn't seem to be a long-term solution because even cchardet isn't guaranteed to find the correct encoding. Is there an option in pd.read_csv to guess the encoding? I'd rather it fail on a few characters then fail to read the entire file. With respect to frequency, nearly everytime I read in a CSV I have this problem and it makes pandas feel a little fragile. Tell me I'm missing something.

Ryan Erwin
  • 807
  • 1
  • 11
  • 30
  • Possible duplicate of [UnicodeDecodeError: 'utf-8' codec can't decode byte](https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte) – DYZ Jan 15 '18 at 20:25
  • I don't see this as a duplicate. I can workaround it. I'm wondering if there's a better way. I can't rely on the encoding to be stable, so I need 1) identify as I read the file which is slow OR 2) find a more flexible solution. I'm hoping there's a feature in pandas that I'm overlooking. – Ryan Erwin Jan 15 '18 at 20:37
  • 3
    I think you'd be better off ensuring that the encoding is stable or explicit and just using that. Otherwise, you'll be stuck having to use tools like these. – juanpa.arrivillaga Jan 15 '18 at 20:38
  • I was afraid of that response, but thanks for the comment. – Ryan Erwin Jan 15 '18 at 21:04

0 Answers0