Problems to read german csv file in python

Question

I am having a german csv file, which I want to read with pd.read_csv.

Data:

The original file looks like this:

So it has two Columns (A,B) and the seperator should be ';',

Problem: When I ran the command:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep=';')

I get the error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

Half-Solution: I understand this could have several reasons, but when I ran the command:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep='delimiter')

I get the following dataset back:

    0
0   Etat;Die ARD-Tochter Degeto hat sich verpflich...
1   Etat;App sei nicht so angenommen worden wie ge...
2   Etat;'Zum Welttag der Suizidprävention ist es ...
3   Etat;Mitarbeiter überreichten Eigentümervertre...
4   Etat;Service: Jobwechsel in der Kommunikations...

so I only get one column instead of the two desired columns,

Target: any idea how to load the dataset correctly that I have:

    0       1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...

Hints/Tries:

When I run the search function over my data in excel, I am also not finding any ;in it.

It seems like that some lines have more then two columns (as you can see for example in line 3 and 13 of my example

Is there a ; in one of your sentences that is misinterpreted as a delimiter? — Philip H., Aug 16 '19 at 09:09
I searched in excel over the dataset, there is no other ; in my dataset — PV8, Aug 16 '19 at 09:11
Maybe the following link already provides an answer:https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data — Philip H., Aug 16 '19 at 09:14
It appears in your Excel screenshot that the third line has been split into three columns, so it is also finding something it thinks is a delimiter there. Can you share the full text of the first three lines? — Simon Notley, Aug 16 '19 at 09:14
Simon you are right, somehow I have sometimes 2 columns and osmetimes three, in line 13 I also have the same problem — PV8, Aug 16 '19 at 09:20

Ankur Sinha · Answer 1 · 2019-08-16T09:26:52.690

Skim through your texts carefully. If you find no leads, follow the below solution.

Note: This is not a perfect solution but a hack and has worked for me multiple times when I worked with German text since I found no other solution.

I just read the whole thing as such and split the string into two desired columns by splitting on the first occurrence of a delimiter.

df['col1'] = df[0].str.split(';', 1).str[0]
df['col2'] = df[0].str.split(';', 1).str[1]

Output:

                            0    col1                   col2
0        Etat;Die ARD-Tochter..  Etat        Die ARD-Tochter
1         Etat;App sei nicht...  Etat          App sei nicht 
2  Etat;Mitarbeiter überreich..  Etat  Mitarbeiter überreich

I just trimmed the texts to demonstrate the example.

jezrael · Accepted Answer · 2019-08-16T09:24:59.550

One possible solution is create one column DataFrame with separator not in data like delimiter and then use Series.str.split with n parameter and expand=True for new DataFrame:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                       encoding='utf-8', header=None, sep='delimiter')

#more general solution is use some value NOT exist in data like yen ¥
#dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
#                      encoding='utf-8', header=None, sep='¥')

df = dataset[0].str.split(';', n=1, expand=True)
df.columns = ['A','B']
print (df)

score 1 · Answer 3 · answered Aug 16 '19 at 09:28

1

This works for me:

import pandas as pd
df = pd.read_csv('german.txt', sep=';', header = None, encoding='iso-8859-1')
df

Output:

       0    1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...
2   Etat    'Zum Welttag der Suizidprävention ist es ...
3   Etat    Mitarbeiter überreichten Eigentümervertre...
4   Etat    Service: Jobwechsel in der Kommunikations...

answered Aug 16 '19 at 09:28

M-M

440
2
16

sorry, this does not work, as I have somewhere 3 columns in my original dataframe and not only 2 – PV8 Aug 16 '19 at 09:37

Problems to read german csv file in python

3 Answers3

Linked