0

I am trying to clean my pandas data frame from duplicate rows. I know how to remove rows where column values are the same but I'm not sure how to do that if (column A on row 1 is equal to column B on row 2 and column B on row 1 is equal to column A on row 2). I hope it is not too confusing. I've added an example of a table below. I would consider rows 2 and 3 to be duplicates. How would I remove them using pandas?

table example

Edit:

Duplicate rows are not necessarily right above or below each other. I need to keep only one of those rows (doesn't matter which one specifically).

1 Answers1

2

Use np.sort to make each row have the same values in order

import pandas as pd
import numpy as np

# toy data
df = pd.DataFrame(data=[[10, 15], [15, 10]], columns=["A", "B"])

# find duplicates rows
duplicated = pd.DataFrame(np.sort(df[["A", "B"]], axis=1), index=df.index).duplicated()

# filter out 
res = df[duplicated]
print(res)

Output

    A   B
1  15  10

Alternative use frozenset to convert each row into a hashable set where order does not matter.

# find duplicates rows
duplicated = df[["A", "B"]].apply(frozenset, axis=1).duplicated()

# filter out
res = df[duplicated]
print(res)
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76