0

Imagine I have a data frame with three columns were column 1 and 2 depicts unique combination with a certain output 'value'. However, I want to filter out those rows were the columns are actually just swapped, since the outcome is the same and retain one outcome of one set of combination.

e.g. 2 - 1 = 1 and 1 - 2 = 1 is technically the same

df <- data.frame(column1 = c(2,3,4,1,3,4,1,2,4), 
                 column2 = c(1,1,1,2,2,2,3,3,3), 
                 value = c(1,2,10,1,2,4,2,2,5))

Since I don't have any reasonable code which can tackle this issue, I appreciate any help and hint!!

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
c.k
  • 3
  • 1
  • Can you show your expected output for the given dataframe? – Ronak Shah Dec 07 '19 at 01:38
  • out <- data.frame(column1 = c(2,3,4,3,4,4), column2 = c(1,1,1,2,2,3), value = c(1,2,10,2,4,5) – c.k Dec 07 '19 at 01:53
  • in fact there are > 10k pairs within the data frame where column1 and column2 are swapped where indeed the output is the same (derived from hamming distance calculation). Since, 1 : 2 == 2 : 1 I want to filter out all those duplicated events with the same output! Thanks – c.k Dec 07 '19 at 01:54

1 Answers1

0

You can use pmin and pmax to sort the columns and then select unique rows.

library(dplyr)
df %>%
  mutate(temp1 = pmax(column1, column2), 
         temp2 = pmin(column1, column2)) %>%
  select(temp1, temp2, value) %>%
  distinct()

#  temp1 temp2 value
#1     2     1     1
#2     3     1     2
#3     4     1    10
#4     3     2     2
#5     4     2     4
#6     4     3     5
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213