I'm writing a data processing scripts, that should handle files of any encodings. When reading the files, I use:
df = pd.read_excel(filepath, dtype=str) if '.xlsx' in \
filepath else pd.read_csv(filepath, dtype=str, low_memory=False,
error_bad_lines=True, sep=CSV_delimiter)
(CSV_delimiter and filepath are defined variables)
While xlsx files work great, the handling of csvs is less straightforward. For many files, I keep receiving encoding errors that resolve only when I open the csvs manually, save with utf-8-bom encoding, and run the script again. I tried adding encoding="utf_8_sig"
, but now receiving errors like this:
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Any recommendations?