2
1 0 0 0 1
0 0 0 0 0
0 1 0 0 1
1 0 0 0 1
0 0 0 0 0
1 0 0 0 1

I have a dataframe(see above). I need to compare the rows of it to get the matching rows. So for the above df I should get row1=row4=row6 and row2=row5 after comparison. Is there any efficient way to do this row comparison in python.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
tanay
  • 468
  • 5
  • 16

2 Answers2

3

Use:

import pandas as pd


df = pd.DataFrame({0: {0: 1, 1: 0, 2: 0, 3: 1, 4: 0, 5: 1}, 
                   1: {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}, 
                   2: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   3: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   4: {0: 1, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1}})
print df
   0  1  2  3  4
0  1  0  0  0  1
1  0  0  0  0  0
2  0  1  0  0  1
3  1  0  0  0  1
4  0  0  0  0  0
5  1  0  0  0  1
#first select only all duplicated rows
df1 = df[df.duplicated(keep=False)]
print df1
   0  1  2  3  4
0  1  0  0  0  1
1  0  0  0  0  0
3  1  0  0  0  1
4  0  0  0  0  0
5  1  0  0  0  1

#sort values by all columns
df2 = df1.sort_values(by=df.columns.tolist())
print df2
   0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1

#find groups
print (~((df2 == df2.shift(1)).all(1))).cumsum()
1    1
4    1
0    2
3    2
5    2
dtype: int32
#print groups    
for i, g in df.groupby((~((df2 == df2.shift(1)).all(1))).cumsum()):
    print g

   0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0
   0  1  2  3  4
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1

#dict comprehension for storing groups
dfs = {i-1: g for i,g in df.groupby((~((df2 == df2.shift(1)).all(1))).cumsum())}
print dfs
{0.0:    0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0, 1.0:    0  1  2  3  4
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1}

print dfs[0]
   0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0

print dfs[1]
   0  1  2  3  4
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Here's how I would think to do it.

import pandas as pd


df = pd.DataFrame({0: {0: 1, 1: 0, 2: 0, 3: 1, 4: 0, 5: 1}, 
                   1: {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}, 
                   2: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   3: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   4: {0: 1, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1}})

groups = df.groupby(df.columns.tolist())
df.loc[:, 'group_num'] = None


for num, group in enumerate(groups):
    df.loc[group[1].index, 'group_num'] = num

Yields...

   0  1  2  3  4 group_num
0  1  0  0  0  1         2
1  0  0  0  0  0         0
2  0  1  0  0  1         1
3  1  0  0  0  1         2
4  0  0  0  0  0         0
5  1  0  0  0  1         2

Why group[1] on the last line?

Because you're iterating through a tuple of the form (group_name, group_table). group[1] accesses the actual grouped DataFrame.

Bruce Pucci
  • 1,821
  • 2
  • 19
  • 26