drop duplicates on multiple columns irrespective of the order (a/b == b/a)

Question

Is there any way to delete rows of duplicated pairs in pandas without taking the order into account?

Dataframe before deleting --> want to delete duplicate pair (yellow colored)

After deleting duplication

example data:

df = pd.DataFrame({'a': [1,2,1,1,2,2],
                   'b': [2,1,3,4,3,4]
                  })

You're welcome. I thought this was a duplicate but I can't find one, so I provided an answer — mozway, Jan 07 '22 at 13:16
@mozway - https://stackoverflow.com/questions/55480504/efficient-way-in-pandas-for-removing-columns-with-duplicate-values-in-different — jezrael, Jan 07 '22 at 13:23
thanks @jezrael given the low activity and non ideal answer of the dup, not sure if I should close here and post there or leave it as it is — mozway, Jan 07 '22 at 13:24
@mozway - yoour solution is good if small data, if large better is dupe. — jezrael, Jan 07 '22 at 13:26
I would say the other way around, sorting is more expensive than creating a set — mozway, Jan 07 '22 at 13:26
@mozway - ya, depends of data, - one year ago - if remember well jpp do some tests for `frozenset`s vs `numpy.sort`. — jezrael, Jan 07 '22 at 13:31

mozway · Answer 1 · 2022-01-07T13:28:44.707

5

You can generate a frozenset to have a common, unordered item to groupby, then take the first item per group:

df.groupby(df.apply(frozenset, axis=1), as_index=False).first()

or use duplicated on the frozenset Series:

df[~df.apply(frozenset, axis=1).duplicated()]

output:

edited Jan 07 '22 at 13:28

answered Jan 07 '22 at 13:15

mozway

1

`df[~df.apply(frozenset, axis=1).duplicated()]`? – Corralien Jan 07 '22 at 13:27

1 Answers1