0

Following this answer:

https://stackoverflow.com/a/47107164/11462274

I try to create a DataFrame that is only the lines not found in another DataFrame, however, not according to all columns, but according to only some specific columns, so I tried to do it this way:

import pandas as pd

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10],
                           'col3' : [1,5,7,9,6,7]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12],
                           'col3' : [1,5,8]})

df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'], 
                   how='left', indicator=True)

df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)

But note that when not using all the columns, they change their name like col2 to col2_x:

   col1  col2_x  col3  col2_y     _merge
0     1      10     1    10.0       both
1     2      11     5    11.0       both
2     3      12     7     NaN  left_only
3     4      13     9     NaN  left_only
4     5      14     6     NaN  left_only
5     3      10     7     NaN  left_only

So when I try to create the final DataFrame without the unnecessary columns, the unused columns are not found to generate the desired filter:

KeyError(f"{not_found} not in index")
KeyError: "['col2'] not in index"
Digital Farmer
  • 1,705
  • 5
  • 17
  • 67

2 Answers2

1

You can use the suffixes parameter of pandas.DataFrame.merge :

df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'], 
                   how='left', indicator=True, suffixes=("", "_"))
​
df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]

​ Output :

print(df_merge)
   col1  col2  col3
2     3    12     7
3     4    13     9
4     5    14     6
5     3    10     7
Timeless
  • 22,580
  • 4
  • 12
  • 30
0

Another option is that considering that it's left join you can just drop the columns from other df that you know would overlap (thereby making a smaller merge result):

df_merge = df1.merge(df2.drop_duplicates().drop(columns=['col2']), 
                     on=['col1','col3'], how='left', indicator=True)

df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)

  col1  col2  col3
2     3    12     7
3     4    13     9
4     5    14     6
5     3    10     7
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105