0

I am having a german csv file, which I want to read with pd.read_csv.

Data:

The original file looks like this:

enter image description here

So it has two Columns (A,B) and the seperator should be ';',

Problem: When I ran the command:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep=';')

I get the error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

Half-Solution: I understand this could have several reasons, but when I ran the command:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep='delimiter')

I get the following dataset back:

    0
0   Etat;Die ARD-Tochter Degeto hat sich verpflich...
1   Etat;App sei nicht so angenommen worden wie ge...
2   Etat;'Zum Welttag der Suizidprävention ist es ...
3   Etat;Mitarbeiter überreichten Eigentümervertre...
4   Etat;Service: Jobwechsel in der Kommunikations...

so I only get one column instead of the two desired columns,

Target: any idea how to load the dataset correctly that I have:

    0       1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...

Hints/Tries:

When I run the search function over my data in excel, I am also not finding any ;in it.

It seems like that some lines have more then two columns (as you can see for example in line 3 and 13 of my example

PV8
  • 5,799
  • 7
  • 43
  • 87
  • 4
    Is there a ; in one of your sentences that is misinterpreted as a delimiter? – Philip H. Aug 16 '19 at 09:09
  • I searched in excel over the dataset, there is no other ; in my dataset – PV8 Aug 16 '19 at 09:11
  • Please provide a [mcve]. – hoefling Aug 16 '19 at 09:11
  • Maybe the following link already provides an answer:https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data – Philip H. Aug 16 '19 at 09:14
  • 1
    It appears in your Excel screenshot that the third line has been split into three columns, so it is also finding something it thinks is a delimiter there. Can you share the full text of the first three lines? – Simon Notley Aug 16 '19 at 09:14
  • Simon you are right, somehow I have sometimes 2 columns and osmetimes three, in line 13 I also have the same problem – PV8 Aug 16 '19 at 09:20

3 Answers3

3

Skim through your texts carefully. If you find no leads, follow the below solution.


Note: This is not a perfect solution but a hack and has worked for me multiple times when I worked with German text since I found no other solution.

I just read the whole thing as such and split the string into two desired columns by splitting on the first occurrence of a delimiter.

df['col1'] = df[0].str.split(';', 1).str[0]
df['col2'] = df[0].str.split(';', 1).str[1]

Output:

                            0    col1                   col2
0        Etat;Die ARD-Tochter..  Etat        Die ARD-Tochter
1         Etat;App sei nicht...  Etat          App sei nicht 
2  Etat;Mitarbeiter überreich..  Etat  Mitarbeiter überreich

I just trimmed the texts to demonstrate the example.

Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
2

One possible solution is create one column DataFrame with separator not in data like delimiter and then use Series.str.split with n parameter and expand=True for new DataFrame:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                       encoding='utf-8', header=None, sep='delimiter')

#more general solution is use some value NOT exist in data like yen ¥
#dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
#                      encoding='utf-8', header=None, sep='¥')

df = dataset[0].str.split(';', n=1, expand=True)
df.columns = ['A','B']
print (df)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

This works for me:

import pandas as pd
df = pd.read_csv('german.txt', sep=';', header = None, encoding='iso-8859-1')
df

Output:

       0    1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...
2   Etat    'Zum Welttag der Suizidprävention ist es ...
3   Etat    Mitarbeiter überreichten Eigentümervertre...
4   Etat    Service: Jobwechsel in der Kommunikations...
M-M
  • 440
  • 2
  • 16
  • sorry, this does not work, as I have somewhere 3 columns in my original dataframe and not only 2 – PV8 Aug 16 '19 at 09:37