6

I have been trying to implement left outer join in python.I see that there is slight difference between left join and left outer join.

As in this link : LEFT JOIN vs. LEFT OUTER JOIN in SQL Server

I could get my hands on below with sample examples:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value1': np.random.randn(4)})

df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
 'value2': np.random.randn(4)})

df3 = df1.merge(df2, on=['key'], how='left')

This gives records from df1 in total (including the intersected ones)

But how do I do the left outer join which has only records from df1 which are not in df2?

Not: This is example only.I might have large number of columns (different) in either dataframes.

Please help.

Community
  • 1
  • 1
marupav
  • 345
  • 1
  • 3
  • 15

1 Answers1

9

set param indicator=True, this will add a column _merge you then filter just the rows that are left_only:

In [46]:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value1': np.random.randn(4)})
​
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
 'value2': np.random.randn(4)})
​
df3 = df1.merge(df2, on=['key'], how='left', indicator=True)
df3

Out[46]:
  key    value1    value2     _merge
0   A -0.346861       NaN  left_only
1   B  1.120739  0.558272       both
2   C  0.023881       NaN  left_only
3   D -0.598771 -0.823035       both
4   D -0.598771  0.369423       both

In [48]:
df3[df3['_merge'] == 'left_only']

Out[48]:
  key    value1  value2     _merge
0   A -0.346861     NaN  left_only
2   C  0.023881     NaN  left_only

if on older version then use isin with ~ to negate the mask:

In [50]:
df3[~df3['key'].isin(df2['key'])]

Out[50]:
  key    value1  value2
0   A -0.346861     NaN
2   C  0.023881     NaN
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • I was using an older version of pandas and couldn't think of this point.Thanks.Any other alternatives assuming still I am on older version? – marupav Jul 04 '16 at 12:23
  • you can use `isin` with `~` to negate the mask to filter the rows that are in `df2` – EdChum Jul 04 '16 at 12:25
  • Thanks EdChum.That Works!! What if I have more one key,say key1,key2,key3.df3[~df3[['key1','key2','key3']].isin(df2[[''key1','key2','key3']])] seems to be missing some information. – marupav Jul 05 '16 at 05:19
  • That's more complicated as you start to introduce permutations unless you can compare row-wise and that is very difficult here. One thing which may work is to set the index to all three key columns for both dfs and then do `df1.index.difference(df2.index)` this will return the rows that are in df1 but not in df2 – EdChum Jul 05 '16 at 07:50
  • Will try that and let you know.Thanks. – marupav Jul 05 '16 at 08:13