How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

Question

I have two dataframes, say df1 and df2, with the same column names.

Example:

df1

C1 | C2 | C3 | C4
A    1    2    AA
B    1    3    A
A    3    2    B

df2

C1 | C2 | C3 | C4
A    1    3    E
B    1    2    C
Q    4    1    Z

I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.

What would be a clean way to do this in Pandas?

So far, based on this answer, I have been able to find the common rows.

common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])

This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.

I have also seen this thread, but the answers all seem to assume a difference on all the columns.

The expected result for the above example (rows common on specified columns removed):

C1 | C2 | C3 | C4
A    3    2    B

I'm not sure what you exactly want as a result. Can you post the desired result dataframe? — PdevG, Jun 20 '18 at 07:34

Roelant · Accepted Answer · 2018-06-20T07:51:28.420

Maybe not the cleanest, but you could add a key column to df1 to check against.

Setting up the datasets

import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
            'C2': [1, 1, 3],
            'C3': [2, 3, 2],
            'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
            'C2': [1, 1, 4],
            'C3': [3, 2, 1],
            'C4': ['E', 'C', 'Z']})

Adding a key, using your code to find the commons

df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)

score 1 · Answer 2 · answered Jun 20 '18 at 08:11

You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.

>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
  C1  C2  C3_x C4_x  C3_y C4_y      _merge
0  A   1   2.0   AA   3.0    E        both
1  B   1   3.0    A   2.0    C        both
2  A   3   2.0    B   NaN  NaN   left_only
3  Q   4   NaN  NaN   1.0    Z  right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
  C1  C2   C3 C4
2  A   3  2.0  B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
  C1  C2   C3 C4
3  Q   4  1.0  Z

score 1 · Answer 3 · answered Jun 20 '18 at 08:24

import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]

df1:

    C1  C2  C3  C4
0   A   1   2   AA
1   B   1   3   A
2   A   3   2   B

df2:

    C1  C2  C3  C4
0   A   1   3   E
1   B   1   2   C
2   Q   4   1   Z

common:

    C1  C2   C3_x C4_x  C3_y C4_y
0   A   1    2    AA    3    E
1   B   1    3    A     2    C

R1:

    C1  C2  C3  C4
2   A   3   2   B

R2:

    C1  C2  C3  C4
2   Q   4   1   Z

Thank you, +1. I find this answer useful. However, if the number of columns becomes large, the expression could become big. — GoodDeeds, Jun 20 '18 at 08:43

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

3 Answers3