2

I have 2 dataframes with strings in the cells:

df1

ID  t1  t2  t3
0   x1  y1  z1
1   x2  y2  z2 
2   x3  y3  z3 
3   x4  y4  z4  
4   x1  y5  z5 

df2

ID  t1  t2  t3
0   x3  y3  z3
1   x4  y4  z4 
2   x1  y1  z1 
3   x2  y2  z2  
4   x1  y7  z5 

I found that I can compare the differences in rows with:

#exactly the same t1, t2, and t3
pd.merge(df1, df2, on=['t1', 't2', 't3'], how='inner')

This will find an exact match between the rows (where t1 in df1 equals t1 in df2, etc.).

How can I find a semi match between the 2 dataframes for a specific column? That is, where there could be a difference in only the specified column in addition to the exact matches? For example, if I specify t2, a match will be t1 in df1 = t1 in df2, t2 in df1 != df2, t3 in df1 = t3 in df3 (for example, row ID=4 in the 2 dataframes will match this in addition to the exact matches).

Update 1:

It seems like a lot of answers take order into consideration (if the rows are not exactly align the method will fail).

Try the following to check your method:

d1 = {'Entity1': ['x1', 'x2','x3','x4','x1', 'x6', 'x1'], 'Relationship': ['y1', 'y2','y3','y4','y5','y6', 'y9'], 'Entity2': ['z1', 'z2','z3','z4','z5','z6', 'z5']}
df1 = pd.DataFrame(data=d1)


d2 = {'Entity1': ['x3', 'x4','x1','x2','x6','x1'], 'Relationship': ['y3', 'y4','y1','y2','y6','y7'], 'Entity2': ['z3', 'z4','z1','z2','z7','z5']}
df2 = pd.DataFrame(data=d2)

Note that one of the exact matches is x2, y2, z2, and one of the semi-match is df1 = x1, y5, z5, df2 = x1, y7,z5

Penguin
  • 1,923
  • 3
  • 21
  • 51
  • Are the following what you are looking for? `df3 = pd.merge(df1, df2, on=['t1', 't2', 't3'], how='outer');df3[df3.isna().any(axis=1)]` The result is a row by row extraction of the mismatched rows. – r-beginners Sep 07 '21 at 04:52
  • Do you just want to merge on `t1` and `t3`? `pd.merge(df1, df2, on=['t1', 't3'], how='inner')` – Kyle Parsons Sep 07 '21 at 19:53

4 Answers4

4

You could merge the two dataframes and then filter for all rows where t1 and t2 are the same on both sides:

df3 = pd.merge(df1, df2, left_index=True, right_index=True)
df3[(df3["t1_x"] == df3["t1_y"]) & (df3["t3_x"] == df3["t3_y"])]
Arne Decker
  • 808
  • 1
  • 3
  • 9
  • That's really good! This seems to give me the number of only 1 mismatched (like `ID=4` row). How can I get both these mismatched and full matches (from the question "...in addition to the exact matches")? Do I need to merge both data frames? That is, merge the result from `pd.merge(df1, df2, on=['t1', 't2', 't3'], how='inner')`? Will I have redundancies? – Penguin Sep 07 '21 at 13:41
  • Are you looking for matches and semi-matches or are you looking for mismatches? – Arne Decker Sep 08 '21 at 11:43
  • Not sure what you define as semi-matches vs mismatches. I'm looking for 1 mismatch in 1 column where the other columns are matches, in addition to full matches. Also, this seems to take order of rows into account (it fails if you switch the rows) – Penguin Sep 12 '21 at 17:56
1

The code I wrote in the comments is a way to get only the discrepancies.

df3 = pd.merge(df1, df2, on=['t1', 't2', 't3'], how='outer')
df3[df3.isna().any(axis=1)]

ID_x    t1  t2  t3  ID_y
4   4.0     x1  y5  z5  NaN
5   NaN     x1  y7  z5  4.0

All mismatches and exact matches can be retrieved by outer join if required.

df3 = pd.merge(df1, df2, on=['t1', 't2', 't3'], how='outer')

    ID_x    t1  t2  t3  ID_y
0   0.0     x1  y1  z1  2.0
1   1.0     x2  y2  z2  3.0
2   2.0     x3  y3  z3  0.0
3   3.0     x4  y4  z4  1.0
4   4.0     x1  y5  z5  NaN
5   NaN     x1  y7  z5  4.0
r-beginners
  • 31,170
  • 3
  • 14
  • 32
1

I wrote a generic and heavy duty function that can do what you are trying to achieve and a lot more (e.g. ignore certain columns)

>>> dict1 = diff_func(df1, df2, uid=['t1','t3'], labels=('df1','df2'))
DataType check: Passed
Uniqueness check: ['t1', 't3']
df1: 5 out of 5 are unique (100.0%).
df2: 5 out of 5 are unique (100.0%).

To get the dataframe where t1 and t3 are the same but t2 is different:

>>> dict1['Diff']                 
  df1 or df2  t1  t2  t3
0        df1  x1  y5  z5
1        df2  x1  y7  z5

The column named df1 or df2 tells you where the row came from and you can visually inspect each row by scanning top and bottom--rather than left and right--to identify the difference. (Imagine having 50 columns, a side-by-side view would not be as readable.)

To get the dataframe where all the columns have the same value (ignoring the index):

>>> dict1['Merge']
   t1  t2  t3
0  x1  y1  z1
1  x2  y2  z2
2  x3  y3  z3
3  x4  y4  z4

The advantage over one of the answers is that it doesn't matter if a specific row of data have different indexes. Note that your x4, y4, z4 sit in ID=3 and ID=1 respectively.

Do refer to my article here for a more detailed explanation.

Update 1

I've modified the code to cater to situations where there are multiple rows where the columns to be compared on have the same value.

>>> dict1['Diff']
  df1 or df2  Entity1  Relationship  Entity2
0        df1       x1            y5       z5
1        df1       x1            y9       z5
2        df2       x1            y7       z5

Based on your new example, there are records which do not have similar values in both tables for the columns to be compared on:

>>> dict1['Left_only']
  Entity1 Relationship Entity2
1      x6           y6      z6
>>> dict1['Right_only']
  Entity1 Relationship Entity2
3      x6           y6      z7

Since Entity2 differs for these 2 records in the df1 and df2 (also known as left and right dataframe respectively, they are not in the Diff dataframe.

Ji Wei
  • 840
  • 9
  • 19
  • Thanks! This seems to almost work. If there are 2 rows in `df1` that are also a match (`x1`, `y9`, `z5`) this stops working though – Penguin Sep 12 '21 at 17:54
  • this should work now, I've updated the code in the initial page to cater to the situation that you mentioned. – Ji Wei Sep 13 '21 at 09:31
  • @Penguin: Did you managed to check with my updated code? – Ji Wei Sep 14 '21 at 14:57
1

Or even better with eq:

>>> df3 = pd.merge(df1, df2, left_index=True, right_index=True)
>>> df3.loc[df1.eq(df2)[['t1', 't3']].all(1)]
   ID_x t1_x t2_x t3_x  ID_y t1_y t2_y t3_y
4     4   x1   y5   z5     4   x1   y7   z5
>>> 

This merges the dataframes and keeps the rows where df1 is the same as df2 on both of the columns t1 and t3.

Edit for the update:

If you also want to match rows in different indexes, try isin:

print(df3[np.isin(df1, df2).any(1) & (~np.isin(df1, df2).all(1))].dropna())

Output:

  Entity1_x Relationship_x Entity2_x Entity1_y Relationship_y Entity2_y
4        x1             y5        z5        x6             y6        z7
5        x6             y6        z6        x1             y7        z5
U13-Forward
  • 69,221
  • 14
  • 89
  • 114