Compare 2 DataFrames for semi matching rows

Question

I have 2 dataframes with strings in the cells:

df1

ID  t1  t2  t3
0   x1  y1  z1
1   x2  y2  z2 
2   x3  y3  z3 
3   x4  y4  z4  
4   x1  y5  z5 

df2

ID  t1  t2  t3
0   x3  y3  z3
1   x4  y4  z4 
2   x1  y1  z1 
3   x2  y2  z2  
4   x1  y7  z5

I found that I can compare the differences in rows with:

#exactly the same t1, t2, and t3
pd.merge(df1, df2, on=['t1', 't2', 't3'], how='inner')

This will find an exact match between the rows (where t1 in df1 equals t1 in df2, etc.).

How can I find a semi match between the 2 dataframes for a specific column? That is, where there could be a difference in only the specified column in addition to the exact matches? For example, if I specify t2, a match will be t1 in df1 = t1 in df2, t2 in df1 != df2, t3 in df1 = t3 in df3 (for example, row ID=4 in the 2 dataframes will match this in addition to the exact matches).

Update 1:

It seems like a lot of answers take order into consideration (if the rows are not exactly align the method will fail).

Try the following to check your method:

d1 = {'Entity1': ['x1', 'x2','x3','x4','x1', 'x6', 'x1'], 'Relationship': ['y1', 'y2','y3','y4','y5','y6', 'y9'], 'Entity2': ['z1', 'z2','z3','z4','z5','z6', 'z5']}
df1 = pd.DataFrame(data=d1)


d2 = {'Entity1': ['x3', 'x4','x1','x2','x6','x1'], 'Relationship': ['y3', 'y4','y1','y2','y6','y7'], 'Entity2': ['z3', 'z4','z1','z2','z7','z5']}
df2 = pd.DataFrame(data=d2)

Note that one of the exact matches is x2, y2, z2, and one of the semi-match is df1 = x1, y5, z5, df2 = x1, y7,z5

Are the following what you are looking for? `df3 = pd.merge(df1, df2, on=['t1', 't2', 't3'], how='outer');df3[df3.isna().any(axis=1)]` The result is a row by row extraction of the mismatched rows. — r-beginners, Sep 07 '21 at 04:52
Do you just want to merge on `t1` and `t3`? `pd.merge(df1, df2, on=['t1', 't3'], how='inner')` — Kyle Parsons, Sep 07 '21 at 19:53

score 4 · Answer 1 · answered Sep 07 '21 at 10:10

4

You could merge the two dataframes and then filter for all rows where t1 and t2 are the same on both sides:

df3 = pd.merge(df1, df2, left_index=True, right_index=True)
df3[(df3["t1_x"] == df3["t1_y"]) & (df3["t3_x"] == df3["t3_y"])]

answered Sep 07 '21 at 10:10

Arne Decker

808
1
3
9

That's really good! This seems to give me the number of only 1 mismatched (like `ID=4` row). How can I get both these mismatched and full matches (from the question "...in addition to the exact matches")? Do I need to merge both data frames? That is, merge the result from `pd.merge(df1, df2, on=['t1', 't2', 't3'], how='inner')`? Will I have redundancies? – Penguin Sep 07 '21 at 13:41
Are you looking for matches and semi-matches or are you looking for mismatches? – Arne Decker Sep 08 '21 at 11:43
Not sure what you define as semi-matches vs mismatches. I'm looking for 1 mismatch in 1 column where the other columns are matches, in addition to full matches. Also, this seems to take order of rows into account (it fails if you switch the rows) – Penguin Sep 12 '21 at 17:56

score 1 · Answer 2 · answered Sep 07 '21 at 14:50

1

The code I wrote in the comments is a way to get only the discrepancies.

df3 = pd.merge(df1, df2, on=['t1', 't2', 't3'], how='outer')
df3[df3.isna().any(axis=1)]

ID_x    t1  t2  t3  ID_y
4   4.0     x1  y5  z5  NaN
5   NaN     x1  y7  z5  4.0

All mismatches and exact matches can be retrieved by outer join if required.

df3 = pd.merge(df1, df2, on=['t1', 't2', 't3'], how='outer')

    ID_x    t1  t2  t3  ID_y
0   0.0     x1  y1  z1  2.0
1   1.0     x2  y2  z2  3.0
2   2.0     x3  y3  z3  0.0
3   3.0     x4  y4  z4  1.0
4   4.0     x1  y5  z5  NaN
5   NaN     x1  y7  z5  4.0

answered Sep 07 '21 at 14:50

r-beginners

31,170
3
14
32

Thanks! But this assumes that the order of the rows matter (if you switch row 4 and row 5 in `df2` this will fail) – Penguin Sep 12 '21 at 17:49
I changed the order of df2 and the result is the same, how can it fail? – r-beginners Sep 13 '21 at 03:09

Ji Wei · Answer 3 · 2021-09-13T13:38:20.283

I wrote a generic and heavy duty function that can do what you are trying to achieve and a lot more (e.g. ignore certain columns)

>>> dict1 = diff_func(df1, df2, uid=['t1','t3'], labels=('df1','df2'))
DataType check: Passed
Uniqueness check: ['t1', 't3']
df1: 5 out of 5 are unique (100.0%).
df2: 5 out of 5 are unique (100.0%).

To get the dataframe where t1 and t3 are the same but t2 is different:

>>> dict1['Diff']                 
  df1 or df2  t1  t2  t3
0        df1  x1  y5  z5
1        df2  x1  y7  z5

The column named df1 or df2 tells you where the row came from and you can visually inspect each row by scanning top and bottom--rather than left and right--to identify the difference. (Imagine having 50 columns, a side-by-side view would not be as readable.)

To get the dataframe where all the columns have the same value (ignoring the index):

>>> dict1['Merge']
   t1  t2  t3
0  x1  y1  z1
1  x2  y2  z2
2  x3  y3  z3
3  x4  y4  z4

The advantage over one of the answers is that it doesn't matter if a specific row of data have different indexes. Note that your x4, y4, z4 sit in ID=3 and ID=1 respectively.

Do refer to my article here for a more detailed explanation.

Update 1

I've modified the code to cater to situations where there are multiple rows where the columns to be compared on have the same value.

>>> dict1['Diff']
  df1 or df2  Entity1  Relationship  Entity2
0        df1       x1            y5       z5
1        df1       x1            y9       z5
2        df2       x1            y7       z5

Based on your new example, there are records which do not have similar values in both tables for the columns to be compared on:

>>> dict1['Left_only']
  Entity1 Relationship Entity2
1      x6           y6      z6
>>> dict1['Right_only']
  Entity1 Relationship Entity2
3      x6           y6      z7

Since Entity2 differs for these 2 records in the df1 and df2 (also known as left and right dataframe respectively, they are not in the Diff dataframe.

Thanks! This seems to almost work. If there are 2 rows in `df1` that are also a match (`x1`, `y9`, `z5`) this stops working though — Penguin, Sep 12 '21 at 17:54
this should work now, I've updated the code in the initial page to cater to the situation that you mentioned. — Ji Wei, Sep 13 '21 at 09:31

U13-Forward · Answer 4 · 2021-09-14T13:57:15.200

1

Or even better with eq:

>>> df3 = pd.merge(df1, df2, left_index=True, right_index=True)
>>> df3.loc[df1.eq(df2)[['t1', 't3']].all(1)]
   ID_x t1_x t2_x t3_x  ID_y t1_y t2_y t3_y
4     4   x1   y5   z5     4   x1   y7   z5
>>>

This merges the dataframes and keeps the rows where df1 is the same as df2 on both of the columns t1 and t3.

Edit for the update:

If you also want to match rows in different indexes, try isin:

print(df3[np.isin(df1, df2).any(1) & (~np.isin(df1, df2).all(1))].dropna())

Output:

  Entity1_x Relationship_x Entity2_x Entity1_y Relationship_y Entity2_y
4        x1             y5        z5        x6             y6        z7
5        x6             y6        z6        x1             y7        z5

edited Sep 14 '21 at 13:57

answered Sep 11 '21 at 07:33

U13-Forward

69,221
14
89
114

Thanks! But this assumes that the order of the rows matter (if you switch row 4 and row 5 in `df2` this will fail) – Penguin Sep 12 '21 at 17:49
@Penguin Yeah, all of the answers are like that, will edit one that works for all rows – U13-Forward Sep 13 '21 at 03:19
@Penguin Edited my answer with something that works for unordered rows, with `isin`, feel free to check it out. – U13-Forward Sep 13 '21 at 03:21
@Penguin What's the desired output for that one? – U13-Forward Sep 13 '21 at 03:39
@Penguin Edited my answer with something that works (if I understand update 1 correctly). – U13-Forward Sep 13 '21 at 03:42
It didn't quite work. The update you made finds the exact matches but not the semi matches. Thanks though! – Penguin Sep 14 '21 at 12:55
@Penguin before the bounty ends, could you show me the expected output for this example data? – U13-Forward Sep 14 '21 at 13:01
@Penguin Weight! I think I though of something, `df3[np.isin(df1, df2).all(1) & (~df1.eq(df2).any(1))]` – U13-Forward Sep 14 '21 at 13:03
Hmm this seems to give the same output as your edit – Penguin Sep 14 '21 at 13:06
@Penguin Check my new answer, this definietely should work – U13-Forward Sep 14 '21 at 13:07
This doesn't print anything... The dataframe is empty... – Penguin Sep 14 '21 at 13:10
@Penguin Could you please edit your question with a desired output? – U13-Forward Sep 14 '21 at 13:50
@Penguin I am so sorry, it was my typo, edited my answer – U13-Forward Sep 14 '21 at 13:57

Compare 2 DataFrames for semi matching rows

4 Answers4

Edit for the update: