pandas how to compare rows of 2 dataframes regardless of order

Question

import pandas as pd
df1 = pd.DataFrame(index=[1,2,3,4])


df1['A'] = [1,2,5,4]
df1['B'] = [5,6,9,8]
df1['C'] = [9,10,1,12]

>>> df1
   A  B   C
1  1  5   9
2  2  6  10
3  5  9   1
4  4  8  12

I want to compare rows of df1 and get a result of row1(1,5,9) == row3(5,9,1).

It means I care only contained items of row and ignore order of items of row.

@jezrael I didn't think about output, but the purpose is to find/remove duplicated rows. — sh kim, Jul 10 '18 at 07:46

jezrael · Accepted Answer · 2018-07-10T07:49:46.933

2

I think need sorting each row by np.sort:

df2 = pd.DataFrame(np.sort(df1.values, axis=1), index=df1.index, columns=df1.columns)
print (df2)
   A  B   C
1  1  5   9
2  2  6  10
3  1  5   9
4  4  8  12

And then remove duplicates by inverted (~) boolean mask created by duplicated:

df2 = pd.DataFrame(np.sort(df1.values, axis=1), index=df1.index)
print (df2)
   0  1   2
1  1  5   9
2  2  6  10
3  1  5   9
4  4  8  12

df1 = df1[~df2.duplicated()]
print (df1)
   A  B   C
1  1  5   9
2  2  6  10
4  4  8  12

edited Jul 10 '18 at 07:49

answered Jul 10 '18 at 07:13

jezrael

822,522
95
1,334
1,252

OP is also asking about row comparisons, not just sorting it. – iDrwish Jul 10 '18 at 07:19
@iDrwish - Yes, this should be first step, I also ask OP in comment for expected output. – jezrael Jul 10 '18 at 07:25

score 0 · Answer 2 · answered Jul 10 '18 at 07:20

If no value is present twice in a columnm you could just simply translate your columnns into a set

row1 = df.iloc[1]
row3 = df.iloc[3] 
set(row1) == set(row3)

it has the advantage that you can then compare your columns, e.g to find if there is a value in one and not the other.

row1 - row3 # find the values that are in row1 but not in row3

pandas how to compare rows of 2 dataframes regardless of order

2 Answers2

Related