7

in the following dataframe I want to keep rows only once if they have duplicate pairs (1 4 and 4 1 are considered the same pair) of Var1 and Var2. I thought of sorting Var1 and Var2 within the row and then remove duplicate rows based on both Var1 and Var2. However, I don't get to my desired result.

This is what my data looks like:

Var1 <- c(1,2,3,4,5,5)
Var2 <- c(4,3,2,1,5,5)
f <- c("blue","green","yellow","red","orange2","grey")
g <- c("blue","green","yellow","red","orange1","grey")
testdata <- data.frame(Var1,Var2,f,g)

I can sort within the rows, however the values of columns f and g should remain untouched, how do I do this?

testdata <- t(apply(testdata, 1, function(x) x[order(x)]))
testdata <- as.data.table(testdata)

Then, I want to remove duplicate rows based on Var1 and Var2

I want to get this as a result:

Var1 Var2 f       g
1    4    blue    blue
2    3    green   green
5    5    orange2 orange1

Thanks for your help!

qg7el
  • 87
  • 1
  • 4

3 Answers3

10

In case people are interested in solving this using dplyr:

library(dplyr)
testdata %>% 
   rowwise() %>%
   mutate(key = paste(sort(c(Var1, Var2)), collapse="")) %>%
   distinct(key, .keep_all=T) %>%
   select(-key)

# Source: local data frame [3 x 4]
# Groups: <by row>
# 
# # A tibble: 3 × 4
#    Var1  Var2       f       g
#   <dbl> <dbl>  <fctr>  <fctr>
# 1     1     4    blue    blue
# 2     2     3   green   green
# 3     5     5 orange2 orange1
fmic_
  • 2,281
  • 16
  • 23
  • How does this work? This is really cool. I thought paste works by pasting the two columns in that order, yet my column pairs have a key that is the same for each duplicate pair. – Adam Mar 04 '19 at 17:42
5

Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated to remove the duplicate rows

testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
#   Var1 Var2       f       g
#1    1    4    blue    blue
#2    2    3   green   green
#5    5    5 orange2 orange1
akrun
  • 874,273
  • 37
  • 540
  • 662
4

If the data are large, as in Sort large amout of data and save repeated pairs of values in R, using apply() on each row will be expensive. Instead, create the set of unique values

uid = unique(unlist(testdata[c("Var1", "Var2")], use.names=FALSE))

determine whether a swap is necessary

swap = match(testdata[["Var1"]], uid) > match(testdata[["Var2"]], uid)

and update

tmp = testdata[swap, "Var1"]
testdata[swap, "Var1"] = testdata[swap, "Var2"]
testdata[swap, "Var2"] = tmp

remove duplicates as before

testdata[!duplicated(testdata[1:2]),]

If there were many additional columns, and copying these were expensive, a more self-contained solution would be

uid = unique(unlist(testdata[c("Var1", "Var2")], use.names=FALSE))
swap = match(testdata[["Var1"]], uid) > match(testdata[["Var2"]], uid)
idx = !duplicated(data.frame(
    V1 = ifelse(swap, testdata[["Var2"]], testdata[["Var1"]]),
    V2 = ifelse(swap, testdata[["Var1"]], testdata[["Var2"]])))
testdata[idx, , drop=FALSE]
Community
  • 1
  • 1
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112