0

I have 4 columns of data A, B, C, D. Some data are repeating such as row 1: P2 XX P6 XX is repeating in row 5: P6 XX P2 XX. Can anyone help me to remove the repeating units from Pandas dataframe?

A   B   C   D
P2  XX  P6  XX
P3  XX  P5  XX
P5  XX  P8  XX
P5  XX  P3  XX
P6  XX  P2  XX
P8  XX  P5  XX
P1  LU  P2  LU
P2  LU  P1  LU
P3  LU  P9  LU
P3  LU  P6  LU
P6  LU  P3  LU
P9  LU  P3  LU

Output:

A  B  C  D 
P2 XX P6 XX 
P3 XX P5 XX 
P5 XX P8 XX 
P1 LU P2 LU 
P3 LU P9 LU 
P3 LU P6 LU
Machavity
  • 30,841
  • 27
  • 92
  • 100

2 Answers2

1

Assuming it's okay to swap columns A and C, you can use np.minimum and np.maximum to swap the two columns and then drop duplicates:

import numpy as np
df.A, df.C = np.minimum(df.A, df.C), np.maximum(df.A, df.C)

df.drop_duplicates()
    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU
Psidom
  • 209,562
  • 33
  • 339
  • 356
1

We can use np.sort on axis=1 sort sort values in rows, then drop_duplicates on the sorted frame. Lastly, use the index to filter df:

import numpy as np


idx = (
    pd.DataFrame(
        np.sort(df.values, axis=1), columns=df.columns
    ).drop_duplicates().index
)

df = df.loc[idx]

Or without a second variable:

df = df.loc[
    pd.DataFrame(
        np.sort(df.values, axis=1), columns=df.columns
    ).drop_duplicates().index
]

df:

    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • And if considering my data set has Input: A B C D P1 XX P1 LU P1 XX P9 LU P1 XX P3 LU P10 XX P10 LU P2 XX P8 LU P4 XX P1 LU P6 XX P8 LU P7 XX P10 LU P9 XX P3 LU P1 LU P4 XX P1 LU P1 XX P10 LU P7 XX P10 LU P10 XX P3 LU P1 XX P3 LU P9 XX P8 LU P2 XX P8 LU P6 XX P9 LU P1 XX – Akash Deep Biswas Jul 25 '21 at 00:07