10

given two large dataframes, is there any concise and efficient code (avoid using any for loop directly) that allow me to obtain the complement of these two dataframes?

the most straight forward way to me is to compute union-intersection as shown in the naive example below, but I do not know how to implement this in an elegant languages of pandas or np

df1= pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                   'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})     
df2= pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})        
intersection= pd.merge(df1, df2, how='inner',on=['key1', 'key2'])
union=pd.merge(df1, df2, how='outer',on=['key1', 'key2'])       


complement=union-intersection

thanks for any comments and answers

user6651227
  • 137
  • 2
  • 7
  • Possible duplicate of [picking out elements based on complement of indices in Python pandas](http://stackoverflow.com/questions/14986510/picking-out-elements-based-on-complement-of-indices-in-python-pandas) – Mathias711 Aug 11 '16 at 21:38
  • @Mathias711, thanks. To apply their solution, I need to find the index of the `intersection` in the `union`. then I can apply their method to exclude previously obtained index – user6651227 Aug 11 '16 at 22:03

1 Answers1

24

Starting with this:

df1= pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                   'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})     
df2= pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})        
intersection  = pd.merge(df1, df2, how='inner',on=['key1', 'key2'])
union         = pd.merge(df1, df2, how='outer',on=['key1', 'key2'])       

print union

     A    B key1 key2    C    D
0   A0   B0   K0   K0   C0   D0
1   A1   B1   K0   K1  NaN  NaN
2   A2   B2   K1   K0   C1   D1
3   A2   B2   K1   K0   C2   D2
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3

print intersection

    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2

union-intersection try this:

union[union.isnull().any(axis=1)]

     A    B key1 key2    C    D
1   A1   B1   K0   K1  NaN  NaN
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3
Merlin
  • 24,552
  • 41
  • 131
  • 206