I try to read a csv (from https://openwrt.org/_media/toh_dump_tab_separated.zip) in python with pandas using pandas.read_csv(). The problem is the encoding of the file. It is not UTF-8, it is not Latin1. And I don't want to go manually through all the codecs (https://docs.python.org/3/library/codecs.html#standard-encodings).
The workaround is opening the file in Libre Office, replacing weird characters with '-', saving as Latin1 and opening in Python.
How do I do it in Python only?
The following code and error are my current status with UTF-8:
import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'utf-8')
(...)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 983: invalid start byte
and with Latin1:
import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'Latin1')
(...)
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2