Reading a csv as utf-8-sig from any encoding

Question

I'm writing a data processing scripts, that should handle files of any encodings. When reading the files, I use:

 df = pd.read_excel(filepath, dtype=str) if '.xlsx' in \
                        filepath else pd.read_csv(filepath, dtype=str, low_memory=False,
                                               error_bad_lines=True, sep=CSV_delimiter)

(CSV_delimiter and filepath are defined variables)

While xlsx files work great, the handling of csvs is less straightforward. For many files, I keep receiving encoding errors that resolve only when I open the csvs manually, save with utf-8-bom encoding, and run the script again. I tried adding encoding="utf_8_sig", but now receiving errors like this: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Any recommendations?

*If* you saved the file with a utf-8 BOM character, the first byte of the file should be `0xef`, not `0xff` (and, of course, you will henceforth need to specify `encoding="utf8_8_sig"` to open the file), so it's not clear, at least to me, what you are describing. And what do you possibly mean by **reading a csv as utf-8-sig from any encoding**? If you specify `utf-8-sig` as your encoding, that can only reliably handle ascii and utf-8 encoded files (with or without a BOM) and conversely if it is a `utf-8` encoded file with a BOM, you *have* to specify `utf-8-sig` as the encoding to read it. — Booboo, Mar 25 '21 at 19:32
Unless you know the encoding of the CSV, you can't reliably open the file. There are libraries that guess the encoding, such as [`chardet`](https://pypi.org/project/chardet), but it isn't perfect. — Mark Tolonen, Mar 25 '21 at 20:35
Thanks @Booboo . I'll clarify: I don't know what encoding will be the files. I wish to convert them to utf-8. — goidelg, Mar 25 '21 at 21:17
That's what it sounded like from your first sentence but then the rest of your question lost me. The comment by @MarkTolonen is "on point", but the problem is even worse than that. A file could be encoded, for example, in utf-32, and you are able to successfully decode using utf-16 (that is, without errors), but the result is not the original text. See [this post](https://stackoverflow.com/questions/66708624/determine-encoding-of-an-item-with-its-start-byte/66709276#66709276). — Booboo, Mar 25 '21 at 22:26
Thanks Mark! What commands should I use to asses the format of an incoming file? e.g. "peek" on the first byte, if it's one of 5 most common encoding case/switch to read accordingly, else raise exception. — goidelg, Mar 26 '21 at 02:51
Wouldn't know exactly without the csv but it could be embedded quotes within quotes that python may not know how to handle or it could be latin characters. — lunastarwarp, Mar 25 '21 at 19:08

Reading a csv as utf-8-sig from any encoding

0 Answers0