6

How can I do a left outer join, excluding the intersection, in Pandas?

I have 2 pandas dataframes

df1 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'accounting', 'IT'], 'col2' : ['az', 'bh', '', '', '']}) 
df2 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'finance', 'finance'], 'col2' : ['', 'az', '', '', '']})

df1

    col1    col2
0   finance az
1   finance bh
2   finance 
3   accounting  
4   IT  

df2

    col1    col2
0   finance 
1   finance az
2   finance 
3   finance 
4   finance 

As you can see the dataframe has blank values as well. I tried using the example and its not giving me the result I want.

common = df1.merge(df2,on=['col1','col2'])
df3=df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]

I want the output to look something like

    col1    col2
3   accounting  
4   IT  
Josiah Yoder
  • 3,321
  • 4
  • 40
  • 58
StatguyUser
  • 2,595
  • 2
  • 22
  • 45

3 Answers3

13

Pandas left outer join exclusion can be achieved by setting pandas merge's indicator=True. Then filter by the indicator in _merge column.

df=pd.merge(df1,df2[['col1']],on=['col1'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']
# this following line is just formating
df = df.reset_index()[['col1', 'col2']] 

Output:

col1    col2
0   accounting  
1   IT  

==================================

====The following is an example showing the mechanism====

df1 = pd.DataFrame({'key1': ['0', '1'],
                     'key2': [-1, -1],
                     'A': ['A0', 'A1'],
                     })


df2 = pd.DataFrame({'key1': ['0', '1'],
                      'key2': [1, -1], 
                    'B': ['B0', 'B1']
                     })

:

df1

Output:

    A   key1    key2
0   A0  0       -1
1   A1  1       -1

:

df2

Output:

    B   key1    key2
0   B0  0       1
1   B1  1       -1

:

df=pd.merge(df1,df2,on=['key1','key2'],how="outer",indicator=True)

:

Output:

     A  key1    key2    B   _merge
0   A0  0   -1  NaN left_only
1   A1  1   -1  B1  both
2   NaN 0   1   B0  right_only

:With the above indicators in the _merge column. you can select rows in one dataframe but not in another.

df=df[df['_merge']=='left_only']
df

Output:

    A   key1    key2    B   _merge
0   A0  0   -1  NaN left_only
Bin
  • 3,645
  • 10
  • 33
  • 57
4

A one liner for this based on Bin's answer may be:

df=pd.merge(df1,df2[['col1']],on=['col1'],how="outer",indicator=True).query('_merge=="left_only"')
Aidan Wood
  • 53
  • 6
0

This fails because you're independently checking for a match in col1 & col2, and excluding a match on either. The empty strings match the empty strings in the finance rows.

You'd want:

df3 = df1[(~df1.col1.isin(common.col1))|(~df1.col2.isin(common.col2))]
df3
Out[150]: 
         col1 col2
1     finance   bh
3  accounting     
4          IT  

To get the rows in df1 not in df2 .

To get specifically

df3
    col1    col2
3   accounting  
4   IT  

you might try just selecting those with a non-matching col1.

df3 = df1[~df1.col1.isin(df2.col1)]
df3
Out[172]: 
         col1 col2
3  accounting     
4          IT

To independently check for a match in col1 & col2 and exclude a match on either while having NaNs compare unequal/never count as a match, you could use

df3 = df1[(~df1.col1.isin(common.col1)|df1.col1.isnull())&(~df1.col2.isin(common.col2)|df1.col2.isnull())]
df3
Out[439]: 
         col1 col2
3  accounting  NaN
4          IT  NaN

assuming you're working with actual NaNs, either None or np.nan, in your actual data, instead of empty strings as in this example. If the latter, you'll need to add

df1.replace('', np.nan, inplace=True)
df2.replace('', np.nan, inplace=True)

first.

EFT
  • 2,359
  • 1
  • 10
  • 11
  • doesn't solve the purpose and not helpful for the original question – StatguyUser Jun 20 '17 at 18:40
  • @Enthusiast Edited, thoughts? – EFT Jun 20 '17 at 18:50
  • It only takes one column col1, whereas i want to consider all columns – StatguyUser Jun 21 '17 at 14:46
  • @Enthusiast Could you clarify, either here or in your main post, what you want to consider about all columns? If you want all rows that don't match in full, then, for this two-column case, you want the first option I posted, with the `1 finance bh` row. If you want all rows that don't share any value with any row in the other df, you want the empty df the method in your initial post gave you. Do you want to do one of those while having blank values always compare unequal? I'll edit momentarily, on that assumption, but if not that's really my last guess at your goal without clarification. – EFT Jun 21 '17 at 15:06