Remove repeating data in pandas

Question

I have 4 columns of data A, B, C, D. Some data are repeating such as row 1: P2 XX P6 XX is repeating in row 5: P6 XX P2 XX. Can anyone help me to remove the repeating units from Pandas dataframe?

A   B   C   D
P2  XX  P6  XX
P3  XX  P5  XX
P5  XX  P8  XX
P5  XX  P3  XX
P6  XX  P2  XX
P8  XX  P5  XX
P1  LU  P2  LU
P2  LU  P1  LU
P3  LU  P9  LU
P3  LU  P6  LU
P6  LU  P3  LU
P9  LU  P3  LU

Output:

A  B  C  D 
P2 XX P6 XX 
P3 XX P5 XX 
P5 XX P8 XX 
P1 LU P2 LU 
P3 LU P9 LU 
P3 LU P6 LU

score 1 · Answer 1 · answered Jul 24 '21 at 23:42

Assuming it's okay to swap columns A and C, you can use np.minimum and np.maximum to swap the two columns and then drop duplicates:

import numpy as np
df.A, df.C = np.minimum(df.A, df.C), np.maximum(df.A, df.C)

df.drop_duplicates()
    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU

score 1 · Accepted Answer · answered Jul 24 '21 at 23:44

1

We can use np.sort on axis=1 sort sort values in rows, then drop_duplicates on the sorted frame. Lastly, use the index to filter df:

import numpy as np


idx = (
    pd.DataFrame(
        np.sort(df.values, axis=1), columns=df.columns
    ).drop_duplicates().index
)

df = df.loc[idx]

Or without a second variable:

df = df.loc[
    pd.DataFrame(
        np.sort(df.values, axis=1), columns=df.columns
    ).drop_duplicates().index
]

df:

    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU

answered Jul 24 '21 at 23:44

Henry Ecker

34,399
18
41
57

And if considering my data set has Input: A B C D P1 XX P1 LU P1 XX P9 LU P1 XX P3 LU P10 XX P10 LU P2 XX P8 LU P4 XX P1 LU P6 XX P8 LU P7 XX P10 LU P9 XX P3 LU P1 LU P4 XX P1 LU P1 XX P10 LU P7 XX P10 LU P10 XX P3 LU P1 XX P3 LU P9 XX P8 LU P2 XX P8 LU P6 XX P9 LU P1 XX – Akash Deep Biswas Jul 25 '21 at 00:07

Remove repeating data in pandas

2 Answers2