Dataframe- Remove similar rows related based on two columns

Question

I have following dataset:

this dataset print correlation of two columns at left if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.

score 0 · Answer 1 · answered Jul 14 '21 at 12:34

0

You could try a self join. Without a code example, it's hard to answer, but something like this maybe:

df.merge(df, left_on="source_column", right_on="destination_column")

You can follow that up with a call to drop_duplicates.

answered Jul 14 '21 at 12:34

suvayu

4,271
2
29
35

this question might make it easy https://stackoverflow.com/questions/32093829/remove-duplicates-from-dataframe-based-on-two-columns-a-b-keeping-row-with-max. But the difference in my question and this is that in my dataframe rows are different but similar – Dijkstra Algorithm Jul 14 '21 at 12:37
1

@DijkstraAlgorithm instead of posting a screenshot, please post your data as text, and code for your approach. You cannot expect others to do the work for you. See the [guideline](https://stackoverflow.com/help/how-to-ask), particularly the section "Help others reproduce the problem". Also, did you look at the documentation for `drop_duplicates` I pointed to? It allows you to "ignore" certain columns. – suvayu Jul 14 '21 at 13:00

Corralien · Accepted Answer · 2021-07-14T13:08:46.063

As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates

# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
          .apply(lambda x: tuple(sorted(x)), axis='columns')

out = df.loc[~key.duplicated()]

>>> out
  source_column destination_column  correlation_Value
0             A                  B                  1
2             C                  E                  2
3             D                  F                  4

Dataframe- Remove similar rows related based on two columns

2 Answers2