How to detect the right file encoding with python?

Question

I try to read a csv (from https://openwrt.org/_media/toh_dump_tab_separated.zip) in python with pandas using pandas.read_csv(). The problem is the encoding of the file. It is not UTF-8, it is not Latin1. And I don't want to go manually through all the codecs (https://docs.python.org/3/library/codecs.html#standard-encodings).

The workaround is opening the file in Libre Office, replacing weird characters with '-', saving as Latin1 and opening in Python.

How do I do it in Python only?

The following code and error are my current status with UTF-8:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'utf-8')

(...)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 983: invalid start byte

and with Latin1:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'Latin1')

(...)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

Encoding appears to be `cp1252`. – Mark Tolonen Dec 16 '20 at 17:31 — Mark Tolonen, Dec 16 '20 at 17:31

JosefZ · Answer 1 · 2020-12-17T17:11:23.413

Use sep parameter:

import pandas as pd
df = pd.read_csv('ToH_dump_tab_separated.csv', encoding = 'cp1252', sep='\t')
print(df)

          pid  ...                                           comments
0       16132  ...                                                NaN
1       16133  ...                                                NaN
2       16134  ...                                                NaN
3       16135  ...                           Clone of Aztech HW550-3G
4       16137  ...  Image build disabled in master with commit d7d...
...       ...  ...                                                ...
1759  9726386  ...                                                NaN
1760  9878711  ...  Rough edges as of December 2020. Realtek targe...
1761  9912125  ...  Works with WL-WN575A3 image according OpenWrt ...
1762  9927580  ...                                                NaN
1763  9946488  ...                                                NaN

[1764 rows x 67 columns]

FYI, the weird character 0xbf is ¿ Inverted Question Mark U+00BF (or \u00BF):

print( df.switch[:2]); print( df.fccid[-2:])

0    Infineon ADM6996I
1                    ¿
Name: switch, dtype: object
1762                    http://¿
1763    https://fcc.io/Q87-03331
Name: fccid, dtype: object

Edit (tnx Mark Tolonen). Encoding appears to be cp1252. There are smart quotes in some of the fields:

print( df.comments[254][288:])

Ignore the “HW v” on the label - it may not say 2 for v2 hardware

Encoding appears to be `cp1252`. There are smart quotes in some of the fields. — Mark Tolonen, Dec 16 '20 at 17:38
Thanks for the help! But how do you know that it is cp1252, @MarkTolonen ? Just by rolling the magic codec dice, or looking sharp at the characters and have a good knowledge about codecs? — Cyoux, Dec 17 '20 at 11:08
@Cyoux Option 2.I loaded the data and created a `set` of the content less the `set` of ASCII characters, and was left with a few French accents and smart quotes when opened as `cp1252`. `latin1` doesn’t support smart quotes. 1252 is a common codec for US and Western European Windows. — Mark Tolonen, Dec 17 '20 at 16:18
@Cyoux read this thread: [What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?](https://stackoverflow.com/q/19109899/3439404) — JosefZ, Dec 17 '20 at 17:15

How to detect the right file encoding with python?

1 Answers1