Efficient way in Pandas for removing columns with duplicate values in different columns

Question

I am looking for a an efficient and elegant way in Pandas to remove "duplicate" rows in a DataFrame that have exactly the same value set but in different columns.

I am ideally looking for a vectorized way to do this as I can already identify very inefficient ways using the Pandas pandas.DataFrame.iterrows() method.

Say my DataFrame is:

source|target|
----------------
| 1   |  2   |
| 2   |  1   |
| 4   |  3   |
| 2   |  7   |
| 3   |  4   |

I want it to become:

source|target|
----------------
| 1   |  2   |
| 4   |  3   |
| 2   |  7   |

This is a duplicate, many questions ask about this. Take look maybe https://stackoverflow.com/questions/51603520/pandas-remove-duplicates-that-exist-in-any-order — rafaelc, Apr 02 '19 at 17:30
This is a duplicate indeed. The link RafaelC provided lies your answer. Your solution is here: `pd.DataFrame(np.sort(df.values, axis=1), columns=df.columns).drop_duplicates()` — Erfan, Apr 02 '19 at 17:32
Possible duplicate of [Sorting df rows horizontally](https://stackoverflow.com/questions/38884131/sorting-df-rows-horizontally) — Erfan, Apr 02 '19 at 17:33

Akhilesh_IN · Accepted Answer · 2019-04-03T03:17:31.733

2

df = df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

    source  target
0   1   2
2   4   3
3   2   7

explanation:

np.sort(df.values,axis=1) is sorting DataFrame column wise

array([[1, 2],
       [1, 2],
       [3, 4],
       [2, 7],
       [3, 4]], dtype=int64)

then making a dataframe from it and checking non duplicated using prefix ~ on duplicated

~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()

0     True
1    False
2     True
3     True
4    False
dtype: bool

and using this as mask getting final output

    source  target
0   1   2
2   4   3
3   2   7

edited Apr 03 '19 at 03:17

answered Apr 02 '19 at 17:55

Akhilesh_IN

1,217
1
13
19

1

Hi Akhilesh, while this may be the correct answer, you should be leaving a little insight/ explanation of what you have done here to make it a quality answer which will help others understand the root cause of problem. – nircraft Apr 02 '19 at 20:19
@nircraft thank you for pointing it out. please check updates – Akhilesh_IN Apr 03 '19 at 03:18

Efficient way in Pandas for removing columns with duplicate values in different columns

1 Answers1

Linked

Related