1

I'm writing a data processing scripts, that should handle files of any encodings. When reading the files, I use:

 df = pd.read_excel(filepath, dtype=str) if '.xlsx' in \
                        filepath else pd.read_csv(filepath, dtype=str, low_memory=False,
                                               error_bad_lines=True, sep=CSV_delimiter)

(CSV_delimiter and filepath are defined variables)

While xlsx files work great, the handling of csvs is less straightforward. For many files, I keep receiving encoding errors that resolve only when I open the csvs manually, save with utf-8-bom encoding, and run the script again. I tried adding encoding="utf_8_sig", but now receiving errors like this: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Any recommendations?

goidelg
  • 316
  • 2
  • 16
  • *If* you saved the file with a utf-8 BOM character, the first byte of the file should be `0xef`, not `0xff` (and, of course, you will henceforth need to specify `encoding="utf8_8_sig"` to open the file), so it's not clear, at least to me, what you are describing. And what do you possibly mean by **reading a csv as utf-8-sig from any encoding**? If you specify `utf-8-sig` as your encoding, that can only reliably handle ascii and utf-8 encoded files (with or without a BOM) and conversely if it is a `utf-8` encoded file with a BOM, you *have* to specify `utf-8-sig` as the encoding to read it. – Booboo Mar 25 '21 at 19:32
  • 1
    Unless you know the encoding of the CSV, you can't reliably open the file. There are libraries that guess the encoding, such as [`chardet`](https://pypi.org/project/chardet), but it isn't perfect. – Mark Tolonen Mar 25 '21 at 20:35
  • Thanks @Booboo . I'll clarify: I don't know what encoding will be the files. I wish to convert them to utf-8. – goidelg Mar 25 '21 at 21:17
  • That's what it sounded like from your first sentence but then the rest of your question lost me. The comment by @MarkTolonen is "on point", but the problem is even worse than that. A file could be encoded, for example, in utf-32, and you are able to successfully decode using utf-16 (that is, without errors), but the result is not the original text. See [this post](https://stackoverflow.com/questions/66708624/determine-encoding-of-an-item-with-its-start-byte/66709276#66709276). – Booboo Mar 25 '21 at 22:26
  • Thanks Mark! What commands should I use to asses the format of an incoming file? e.g. "peek" on the first byte, if it's one of 5 most common encoding case/switch to read accordingly, else raise exception. – goidelg Mar 26 '21 at 02:51
  • Wouldn't know exactly without the csv but it could be embedded quotes within quotes that python may not know how to handle or it could be latin characters. – lunastarwarp Mar 25 '21 at 19:08

0 Answers0