this dataset print correlation of two columns at left if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.
Asked
Active
Viewed 96 times
2 Answers
0
You could try a self join. Without a code example, it's hard to answer, but something like this maybe:
df.merge(df, left_on="source_column", right_on="destination_column")
You can follow that up with a call to drop_duplicates
.

suvayu
- 4,271
- 2
- 29
- 35
-
this question might make it easy https://stackoverflow.com/questions/32093829/remove-duplicates-from-dataframe-based-on-two-columns-a-b-keeping-row-with-max. But the difference in my question and this is that in my dataframe rows are different but similar – Dijkstra Algorithm Jul 14 '21 at 12:37
-
1@DijkstraAlgorithm instead of posting a screenshot, please post your data as text, and code for your approach. You cannot expect others to do the work for you. See the [guideline](https://stackoverflow.com/help/how-to-ask), particularly the section "Help others reproduce the problem". Also, did you look at the documentation for `drop_duplicates` I pointed to? It allows you to "ignore" certain columns. – suvayu Jul 14 '21 at 13:00
0
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4

Corralien
- 109,409
- 8
- 28
- 52