How can I pick out the difference between to columns of the same name in two dataframes?
I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X'])
, i'll get the common X values of A and B, but how can i get the "non-common" ones?
Asked
Active
Viewed 3.2k times
43

Polly
- 1,057
- 5
- 14
- 23
2 Answers
83
If you change the merge type to how='outer'
and indicator=True
this will add a column to tell you whether the values are left/both/right only:
In [2]:
A = pd.DataFrame({'x':np.arange(5)})
B = pd.DataFrame({'x':np.arange(3,8)})
print(A)
print(B)
x
0 0
1 1
2 2
3 3
4 4
x
0 3
1 4
2 5
3 6
4 7
In [3]:
pd.merge(A,B, how='outer', indicator=True)
Out[3]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
3 3.0 both
4 4.0 both
5 5.0 right_only
6 6.0 right_only
7 7.0 right_only
You can then filter the resultant merged df on the _merge
col:
In [4]:
merged = pd.merge(A,B, how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
Out[4]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
You can also use isin
and negate the mask to find values not in B:
In [5]:
A[~A['x'].isin(B['x'])]
Out[5]:
x
0 0
1 1
2 2

EdChum
- 376,765
- 198
- 813
- 562
16
The accepted answer gives a so called LEFT JOIN IF NULL
in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both
.
In this case we use DataFrame.merge
& DataFrame.query
:
df1 = pd.DataFrame({'A':list('abcde')})
df2 = pd.DataFrame({'A':list('cdefgh')})
print(df1, '\n')
print(df2)
A
0 a # <- only df1
1 b # <- only df1
2 c # <- both
3 d # <- both
4 e # <- both
A
0 c # both
1 d # both
2 e # both
3 f # <- only df2
4 g # <- only df2
5 h # <- only df2
df = (
df1.merge(df2,
on='A',
how='outer',
indicator=True)
.query('_merge != "both"')
.drop(columns='_merge')
)
print(df)
A
0 a
1 b
5 f
6 g
7 h

Erfan
- 40,971
- 8
- 66
- 78