Question
What is the best way to open a German csv file with pandas?
I have a German csv file with the following columns:
- Datum: Date in the format 'DD.MM.YYYY'
- Umlaute: German names with special characters specific to the German language
- Zahlen: Numbers in the format '000.000,00'
My expected output is:
Umlaute Zahlen
Datum
2020-01-01 Rüdiger 1000000.11
2020-01-02 Günther 12.34
2020-01-03 Jürgen 567.89
Sample data is provided below (see File).
1st attempt: Use pd.read_csv() without parameters
df = pd.read_csv('german_csv_test.csv')
This throws an UnicodeDecodeError
:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 12: invalid start byte
2nd attempt: Use pd.read_csv with specifying encoding and separation
df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
This throws no error, but it is far from my desired output:
- The dates are strings not datetimes.
- The numbers aren't float, but objects.
- The column 'Datum' is not the index.
Datum Umlaute Zahlen
0 01.01.2020 Rüdiger 1.000.000,11
1 02.01.2020 Günther 12,34
2 03.01.2020 Jürgen 567,89
3rd attempt: Cleaning up
df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
df['Datum'] = pd.to_datetime(df['Datum'])
df = df.set_index('Datum')
df['Zahlen'] = pd.to_numeric(df['Zahlen'])
Now, I have four lines of code and it still does not work. The last line throws an error ValueError: Unable to parse string " 1.000.000,11 " at position 0
. If I comment the last line out, it works. But the dates still are wrong, because day and month are switched.
Umlaute Zahlen
Datum
2020-01-01 Rüdiger 1.000.000,11
2020-02-01 Günther 12,34
2020-03-01 Jürgen 567,89
File
My file german_csv_test.csv
looks like this:
Datum;Umlaute;Zahlen
01.01.2020;Rüdiger; 1.000.000,11
02.01.2020;Günther; 12,34
03.01.2020;Jürgen; 567,89
It is encoded as 'cp1252'. I saved it on Windows with the option "CSV (MS-DOS)".