Concatenate dataframes where column names in dataframe differ

Question

I have multiple dataframes and I want to get the result_df by concatenating multiple dataframes. Any suggestions on this?

I am using following code but I am not getting error: Error tokenizing data. C error: Expected 1 fields in line 4, saw 5

list_ = []

    if os.path.exists(csvfile):
        df = pd.read_csv(csvfile, sep=',', encoding='utf-8')            list_.append(pd.concat(df))

frame = pd.concat(list_,ignore_index=True)



df1 = 
  Apple Banana 
0 1     7
1 2     10
2 4     5
3 5     1
4 7     5

df2 =

  Apple Banana Carrot
0 1     7       5
1 2     10      8
2 4     5       8 
3 5     1       2
4 7     5       1


df3 =

 Apple Carrot Mango
0 1       5      2
1 2       8      3
2 4       8      7
3 5       2      1
4 7       1      5


result_df = 

  Apple Banana  Carrot Mango
0  1     7        n.a   n.a   
1  2     10       n.a   n.a
2  4     5        n.a   n.a
3  5     1        n.a   n.a
4  7     5        n.a   n.a
5  1     7        5     n.a
6  2     10       8     n.a
7  4     5        8     n.a
8  5     1        2     n.a
9  7     5        1     n.a
10 1     n.a       5      2
11 2     n.a       8      3
12 4     n.a       8      7
13 5     n.a       2      1
14 7     n.a       1      5

jezrael · Accepted Answer · 2018-04-04T10:13:53.597

1

I believe need error_bad_lines=False and first concat is not necessary:

list_ = []

if os.path.exists(csvfile):
    list_.append(pd.read_csv(csvfile, sep=',', encoding='utf-8', error_bad_lines=False))

frame = pd.concat(list_,ignore_index=True)

Another solution:

import glob

files = glob.glob('files/*.csv')
list_ = [pd.read_csv(fp, encoding='utf-8', error_bad_lines=False) for fp in files]
df = pd.concat(list_, ignore_index=True)

EDIT: For find problematic rows is possible check length of header with length of rows:

import csv

with open('a.csv') as csv_file:
    reader = csv.reader(csv_file, delimiter=',')
    len_header = len(next(reader))
    for row in reader:
        if (len(row) != len_header):
            print ("Length of row is: %r" % len(row) )
            print (row)

Length of row is: 3
['4', '5', '20']
Length of row is: 4
['5', '1', '5', '7']

df = pd.read_csv('a.csv', warn_bad_lines=True, error_bad_lines=False)
print (df)
   Apple  Banana 
0      1        7
1      2       10
2      7        5
b'Skipping line 4: expected 2 fields, saw 3\nSkipping line 5: expected 2 fields, saw 4\n'

a.csv:

Apple,Banana 
1,7
2,10
4,5,20
5,1,5,7
7,5

edited Apr 04 '18 at 10:13

answered Apr 04 '18 at 07:54

jezrael

822,522
95
1,334
1,252

i am afraid that error_bad_lines may remove some data from my dataframe. Is there a way check what those bad_lines are ? I am loading a file which has close to 1 million lines. – Sun Apr 04 '18 at 08:10
Not easy, but you can try [this](https://stackoverflow.com/q/32334966/2901002) solution. – jezrael Apr 04 '18 at 08:15
Thanks Jezrael. It works but i have to figure out what is getting removed. – Sun Apr 04 '18 at 10:09
@Sun - I ma working on solution for this, give me a sec. – jezrael Apr 04 '18 at 10:10
Thank you Jezrael. Appreciate your help. – Sun Apr 06 '18 at 08:11
@Sun - You are welcome! Be free upvote my solution - click to small triangle above `0` above accepting mark. Thanks. – jezrael Apr 06 '18 at 08:12
@Sun - Thank you. – jezrael Apr 11 '18 at 05:03

Concatenate dataframes where column names in dataframe differ

1 Answers1