Pandas Merge DataFrames without rows overlap

Question

I have two dataframes like these:

They have the same columns.

Since I am broadcasting an API, they usually hava some overlap, which can be handled by the tradeID which is unique.

I have tried some stuff like:

df2 = df0.join(df1, how='outer', lsuffix='_caller', rsuffix='_other')

and

df2 = df0.merge(df1, left_index=True, right_index=True)

But the results are respectively:

and

I am looking for a union without overlap, could someone help me?

So when a `tradeID` is present in both data frames, what do you expect to appear in the merged result? — Igor Raush, Jun 01 '17 at 23:26
@IgorRaush, both rows would be exactly the same, I would like to keep just one of them, please also note that `tradeID` is an index — Thiago Melo, Jun 01 '17 at 23:29
the code: `df2 = df0.merge(df1, how='outer')` works but it throws my indexes away — Thiago Melo, Jun 01 '17 at 23:36

elPastor · Accepted Answer · 2017-06-02T00:55:38.577

Seems like combine_first() should do it for you:

df2 = df0.combine_first(df1)

...where df0 takes precedence over df1 when the indicies match. Although in your case, if they're identical, it doesn't really matter. But if they're not identical, that's how combine_first() works.

The following is an example of it working with dummy data.

Code:

import pandas as pd
import io

a = io.StringIO(u'''
tradeID,amount,date
X001,100,1/1/2016
X002,200,1/2/2016
X003,300,1/3/2016
X005,500,1/5/2016
''')

b = io.StringIO(u'''
tradeID,amount,date
X004,400,1/4/2016
X005,500,1/5/2016
X006,600,1/6/2016
''')

dfA = pd.read_csv(a, index_col = 'tradeID')
dfB = pd.read_csv(b, index_col = 'tradeID')

df = dfA.combine_first(dfB)

Output:

         amount      date
tradeID                  
X001      100.0  1/1/2016
X002      200.0  1/2/2016
X003      300.0  1/3/2016
X004      400.0  1/4/2016
X005      500.0  1/5/2016
X006      600.0  1/6/2016

If you really want to use merge you can still do that, but you'll need to add some syntax to keep your indicies (more info):

df = dfA.reset_index().merge(dfB.reset_index(), how = 'outer').set_index('tradeID')

I ran super rudimentary timing on these two options and combine_first() consistently beat merge by nearly 3x on this very small data set.

...and Igor Raush's version tested at or slightly faster than combine_first().

awesome! it worked exactly like i needed! thank you very much! — Thiago Melo, Jun 02 '17 at 13:10

score 1 · Answer 2 · answered Jun 01 '17 at 23:47

1

One way to accomplish this is

pd.concat([df0, df1]).loc[lambda df: ~df.index.duplicated()]

answered Jun 01 '17 at 23:47

Igor Raush

15,080
1
34
55

Pandas Merge DataFrames without rows overlap

2 Answers2