Merge Pandas Dataframe on a column with structured data

Question

Scenario: Following up from a previous question on how to read an excel file from a serve into a dataframe (How to read an excel file directly from a Server with Python), I am trying to merge the contexts of multiple dataframes (which contain data from excel worksheets).

Issue: Even after searching for similar issues here in SO, I still was not able to solve the problem.

Format of data (each sheet is read into a dataframe):

Sheet 1 (db1)
Name       CUSIP       Date       Price

 A          XXX     01/01/2001     100
 B          AAA     02/05/2005      90
 C          ZZZ     03/07/2006      95

Sheet2 (db2)
Ident      CUSIP       Value      Class

 123        XXX          0.5        AA
 444        AAA          1.3        AB
 555        ZZZ          2,8        AC

Wanted output (fnl):

Name       CUSIP       Date       Price       Ident       Value      Class
 A          XXX     01/01/2001     100         123          0.5        AA
 B          AAA     02/05/2005      90         444          1.3        AB
 C          ZZZ     03/07/2006      95         555          2.8        AC

What I already tried: I am trying to use the merge function to match each dataframe, but I am getting the error on the "how" part.

fnl = db1

fnl = fnl.merge(db2, how='outer', on=['CUSIP'])

fnl = fnl.merge(db3, how='outer', on=['CUSIP'])

fnl = fnl.merge(bte, how='outer', on=['CUSIP'])

I also tried the concatenate, but I just get a list of dataframes, instead of a single output.

wsframes = [db1 ,db2, db3]

fnl = pd.concat(wsframes, axis=1)

Question: What is the proper way to do this operation?

KeyError: 'CUSIP' for the merge procedure. For the concat I get the list of dataframes, but I cannot put them together to do operations. — DGMS89, Aug 18 '17 at 13:32
Your error suggests that 'CUSIP' is not visible as a column. What do you get when writing `db1['CUSIP']` or `db2['CUSIP']` ? — omu_negru, Aug 18 '17 at 13:34
Same error in fact. Can that be due to the data being separated from the headers by one empty row? — DGMS89, Aug 18 '17 at 13:36
They appear as Unnamed, I guess the problem is the empty row. — DGMS89, Aug 18 '17 at 13:42
Yes. The problem was the extra row. I removed it and now it works. thanks for the help. — DGMS89, Aug 18 '17 at 13:45

jezrael · Accepted Answer · 2017-08-18T13:40:51.343

6

It seems you need:

from functools import reduce
#many dataframes
dfs = [df1,df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
  Name CUSIP        Date  Price  Ident Value Class
0    A   XXX  01/01/2001    100    123   0.5    AA
1    B   AAA  02/05/2005     90    444   1.3    AB
2    C   ZZZ  03/07/2006     95    555   2,8    AC

But columns in each dataframe has to be different (no matched columns (CUSIP here)), else get _x and _y suffixes:

dfs = [df1,df1, df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
  Name_x CUSIP      Date_x  Price_x Name_y      Date_y  Price_y  Ident Value  \
0      A   XXX  01/01/2001      100      A  01/01/2001      100    123   0.5   
1      B   AAA  02/05/2005       90      B  02/05/2005       90    444   1.3   
2      C   ZZZ  03/07/2006       95      C  03/07/2006       95    555   2,8   

  Class  
0    AA  
1    AB  
2    AC

edited Aug 18 '17 at 13:40

answered Aug 18 '17 at 13:36

jezrael

822,522
95
1,334
1,252

I think he already tried merge, but was complaining of getting an error – omu_negru Aug 18 '17 at 13:38
I don't think ti's necessary to use another library like functools. He could also set_index on CUSIP and do the merge. – hyamanieu Aug 18 '17 at 13:39
@Wli - yes, but if need merge many dataframes, then this is simplier. – jezrael Aug 18 '17 at 13:39
1

Nice answer ~~! – BENY Aug 18 '17 at 14:15

Merge Pandas Dataframe on a column with structured data

1 Answers1