-1

I want to compare two data frames and check if there are duplicated rows. We assume that the order of columns doesn't matter so if df1 looks like that:

 V2 V3
 71 78
 90 13
 12 67
 56 32

and df2 like that:

V2 V3
89 45
77 88
78 71
90 13

Then the non duplicated rows from both df will be:

12 67
56 32
89 45
77 88

How can I achieve this goal in easy way?

smci
  • 32,567
  • 20
  • 113
  • 146
Jagoda
  • 424
  • 1
  • 5
  • 18
  • 1
    Do you mean a data.table or data.frame object? These are two objects while same in function, they differ in functionality. – Roman Luštrik May 07 '15 at 12:48
  • I meant data frame, sorry – Jagoda May 07 '15 at 12:51
  • What does *"check if there are duplicated rows"* mean you want as the output format? Just the count of duplicated rows? A boolean vector showing which rows are duplicated? The actual duplicated rows? The nonduplicated rows? – smci May 07 '15 at 12:51
  • I want another df with non duplicated rows, as in my example. – Jagoda May 07 '15 at 12:53
  • So the order of the columns doesn't matter? Row 1 in df1 is the same as row 3 in df2? – Roman Luštrik May 07 '15 at 12:53
  • I edited it once again as V1s were just row ids and now order doesn't matter – Jagoda May 07 '15 at 12:55
  • 1
    A brute-force way is we'd need to compare df1 to all possible col-wise permutations of df2, mark and filter out duplicate rows. If df2 has C columns, that's C! permutations. – smci May 07 '15 at 12:58
  • 1
    And if your actual data.frame has more than two columns, see this answer: http://stackoverflow.com/questions/9028369/removing-duplicate-combinations-in-r-irrespective-of-order – Sam Firke May 07 '15 at 13:02

3 Answers3

3

You could try

df3 <- rbind(df1, df2)
df4 <- df3
df3[] <-  cbind(do.call(pmax, df3), do.call(pmin, df3))

df4[!(duplicated(df3)|duplicated(df3, fromLast=TRUE)),]
#  V2 V3
#3 12 67
#4 56 32
#5 89 45
#6 77 88
akrun
  • 874,273
  • 37
  • 540
  • 662
  • what effect do the empty brackets in line 3 achieve as opposed to leaving it out? I've never seen that. – Pierre L May 07 '15 at 14:40
  • 1
    @plafort It is just to preserve the original structure of `df3` as a data.frame as the `cbind` step will output a matrix. – akrun May 07 '15 at 14:42
1

The solution provided below works for your example data. This approach may be inefficient for rather large dataset. Then again, computer time is cheap. :)

df1 <- read.table(text = " V2 V3
 71 78
 90 13
 12 67
 56 32", header = TRUE)

df2 <- read.table(text = "V2 V3
89 45
77 88
78 71
90 13", header = TRUE)

throwoutFunction <- function(x, ca) {
  find.duplicates <- apply(ca, MARGIN = 1, FUN = function(y, x) y %in% x, x = x)
  filter.duplicates <- apply(find.duplicates, MARGIN = 2, all)
  if (any(filter.duplicates)) {
    return(data.frame(V2 = NA, V3 = NA))
  } else {
    data.frame(V2 = x[1], V3 = x[2])
  }
}
out1 <- do.call("rbind", apply(df1, MARGIN = 1, FUN = throwoutFunction, ca = df2))

out2 <- do.call("rbind", apply(df2, MARGIN = 1, FUN = throwoutFunction, ca = df1))

out <- na.omit(rbind(out1, out2))
rownames(out) <- 1:nrow(out)
out

  V2 V3
1 12 67
2 56 32
3 89 45
4 77 88
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1

Here's a dplyr solution which will probably be pretty quick on larger datasets

df1 <- data_frame( v1 = c(71,90,12,56), v2 = c(78,13,67,32))
df2 <- data_frame( v1 = c(89,77,78,90), v2 = c(45,88,71,13) )

df3 <- bind_rows(df1, df2)

df3 %>%
  rowwise() %>% 
  mutate(key = paste0( min(v1, v2), max(v1, v2))) %>% 
  group_by(key) %>% 
  mutate( size = n()) %>% 
  filter( size == 1)

This solution only works for two grouping variables, to extend it to multiple variables you basically just need to adjust how to manufacture the key.

Edit: I misunderstood the problem as per comments below.

Shorpy
  • 1,549
  • 13
  • 28
  • Could you show the output using the OP's dataset. I am not getting the expected output as showed by OP using this method. `df3 %>% group_by(V3, V2) %>% mutate( size = n()) %>% filter( size == 1)` where `df3 <- bind_rows(df1, df2)` – akrun May 07 '15 at 15:28
  • Oh, I'm sorry I misunderstood the problem. I didn't realize that the columns should be reversible. Let me think for a sec. – Shorpy May 07 '15 at 15:57
  • Okay, I think the edited solution above solves the problem. For more than two comparisson variables, you need to change how to create the key to something like `paste0(sort(row))` – Shorpy May 07 '15 at 16:05