0

I have following dataset: enter image description here

this dataset print correlation of two columns at left if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.

2 Answers2

0

You could try a self join. Without a code example, it's hard to answer, but something like this maybe:

df.merge(df, left_on="source_column", right_on="destination_column")

You can follow that up with a call to drop_duplicates.

suvayu
  • 4,271
  • 2
  • 29
  • 35
  • this question might make it easy https://stackoverflow.com/questions/32093829/remove-duplicates-from-dataframe-based-on-two-columns-a-b-keeping-row-with-max. But the difference in my question and this is that in my dataframe rows are different but similar – Dijkstra Algorithm Jul 14 '21 at 12:37
  • 1
    @DijkstraAlgorithm instead of posting a screenshot, please post your data as text, and code for your approach. You cannot expect others to do the work for you. See the [guideline](https://stackoverflow.com/help/how-to-ask), particularly the section "Help others reproduce the problem". Also, did you look at the documentation for `drop_duplicates` I pointed to? It allows you to "ignore" certain columns. – suvayu Jul 14 '21 at 13:00
0

As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates

# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
          .apply(lambda x: tuple(sorted(x)), axis='columns')

out = df.loc[~key.duplicated()]
>>> out
  source_column destination_column  correlation_Value
0             A                  B                  1
2             C                  E                  2
3             D                  F                  4
Corralien
  • 109,409
  • 8
  • 28
  • 52