1

I have multiple dataframes and I want to get the result_df by concatenating multiple dataframes. Any suggestions on this?

I am using following code but I am not getting error: Error tokenizing data. C error: Expected 1 fields in line 4, saw 5

list_ = []

    if os.path.exists(csvfile):
        df = pd.read_csv(csvfile, sep=',', encoding='utf-8')            list_.append(pd.concat(df))

frame = pd.concat(list_,ignore_index=True)



df1 = 
  Apple Banana 
0 1     7
1 2     10
2 4     5
3 5     1
4 7     5

df2 =

  Apple Banana Carrot
0 1     7       5
1 2     10      8
2 4     5       8 
3 5     1       2
4 7     5       1


df3 =

 Apple Carrot Mango
0 1       5      2
1 2       8      3
2 4       8      7
3 5       2      1
4 7       1      5


result_df = 

  Apple Banana  Carrot Mango
0  1     7        n.a   n.a   
1  2     10       n.a   n.a
2  4     5        n.a   n.a
3  5     1        n.a   n.a
4  7     5        n.a   n.a
5  1     7        5     n.a
6  2     10       8     n.a
7  4     5        8     n.a
8  5     1        2     n.a
9  7     5        1     n.a
10 1     n.a       5      2
11 2     n.a       8      3
12 4     n.a       8      7
13 5     n.a       2      1
14 7     n.a       1      5
K.Dᴀᴠɪs
  • 9,945
  • 11
  • 33
  • 43
Sun
  • 1,855
  • 5
  • 21
  • 26

1 Answers1

1

I believe need error_bad_lines=False and first concat is not necessary:

list_ = []

if os.path.exists(csvfile):
    list_.append(pd.read_csv(csvfile, sep=',', encoding='utf-8', error_bad_lines=False))

frame = pd.concat(list_,ignore_index=True)

Another solution:

import glob

files = glob.glob('files/*.csv')
list_ = [pd.read_csv(fp, encoding='utf-8', error_bad_lines=False) for fp in files]
df = pd.concat(list_, ignore_index=True)

EDIT: For find problematic rows is possible check length of header with length of rows:

import csv

with open('a.csv') as csv_file:
    reader = csv.reader(csv_file, delimiter=',')
    len_header = len(next(reader))
    for row in reader:
        if (len(row) != len_header):
            print ("Length of row is: %r" % len(row) )
            print (row)

Length of row is: 3
['4', '5', '20']
Length of row is: 4
['5', '1', '5', '7']

df = pd.read_csv('a.csv', warn_bad_lines=True, error_bad_lines=False)
print (df)
   Apple  Banana 
0      1        7
1      2       10
2      7        5
b'Skipping line 4: expected 2 fields, saw 3\nSkipping line 5: expected 2 fields, saw 4\n'

a.csv:

Apple,Banana 
1,7
2,10
4,5,20
5,1,5,7
7,5
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252