2

Scenario: Following up from a previous question on how to read an excel file from a serve into a dataframe (How to read an excel file directly from a Server with Python), I am trying to merge the contexts of multiple dataframes (which contain data from excel worksheets).

Issue: Even after searching for similar issues here in SO, I still was not able to solve the problem.

Format of data (each sheet is read into a dataframe):

Sheet 1 (db1)
Name       CUSIP       Date       Price

 A          XXX     01/01/2001     100
 B          AAA     02/05/2005      90
 C          ZZZ     03/07/2006      95

Sheet2 (db2)
Ident      CUSIP       Value      Class

 123        XXX          0.5        AA
 444        AAA          1.3        AB
 555        ZZZ          2,8        AC

Wanted output (fnl):

Name       CUSIP       Date       Price       Ident       Value      Class
 A          XXX     01/01/2001     100         123          0.5        AA
 B          AAA     02/05/2005      90         444          1.3        AB
 C          ZZZ     03/07/2006      95         555          2.8        AC

What I already tried: I am trying to use the merge function to match each dataframe, but I am getting the error on the "how" part.

fnl = db1

fnl = fnl.merge(db2, how='outer', on=['CUSIP'])

fnl = fnl.merge(db3, how='outer', on=['CUSIP'])

fnl = fnl.merge(bte, how='outer', on=['CUSIP'])

I also tried the concatenate, but I just get a list of dataframes, instead of a single output.

wsframes = [db1 ,db2, db3]

fnl = pd.concat(wsframes, axis=1)

Question: What is the proper way to do this operation?

DGMS89
  • 1,507
  • 6
  • 29
  • 60

1 Answers1

6

It seems you need:

from functools import reduce
#many dataframes
dfs = [df1,df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
  Name CUSIP        Date  Price  Ident Value Class
0    A   XXX  01/01/2001    100    123   0.5    AA
1    B   AAA  02/05/2005     90    444   1.3    AB
2    C   ZZZ  03/07/2006     95    555   2,8    AC

But columns in each dataframe has to be different (no matched columns (CUSIP here)), else get _x and _y suffixes:

dfs = [df1,df1, df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
  Name_x CUSIP      Date_x  Price_x Name_y      Date_y  Price_y  Ident Value  \
0      A   XXX  01/01/2001      100      A  01/01/2001      100    123   0.5   
1      B   AAA  02/05/2005       90      B  02/05/2005       90    444   1.3   
2      C   ZZZ  03/07/2006       95      C  03/07/2006       95    555   2,8   

  Class  
0    AA  
1    AB  
2    AC  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252