This is an issue that is quite common for me. So common that it seems I'm missing something (i.e., there has to be a better way). There are multiple posts on SO dedicated to this problem, but they seem to be workarounds rather than an actual solution. So, I have a CSV to read in:
df = pd.read_csv(filename.csv)
I get the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 9: invalid continuation byte
The solution is to use cchardet
or chardet
(cchardet
is much much faster):
import cchardet as chardet
with open(filename.csv, 'rb') as f:
result = chardet.detect(f.read())
df = pd.read_csv(filename.csv, encoding=result['encoding'])
This works but doesn't seem to be a long-term solution because even cchardet
isn't guaranteed to find the correct encoding. Is there an option in pd.read_csv
to guess the encoding? I'd rather it fail on a few characters then fail to read the entire file. With respect to frequency, nearly everytime I read in a CSV I have this problem and it makes pandas
feel a little fragile. Tell me I'm missing something.