Merging two data frames with different sizes by matching their columns

Question

I am trying to "merge" column V of a Data Frame in another one if the columns X and Y are equals (I have to match dOne.X == dTwo.X & dOne.Y == dTwo.Y and also dOne.X == dTwo.Y & dOne.Y == dTwo.X) I solved this using a for loop, but it is slow when the Data Frame dOne is big (in my machine it takes 25 minutes if length(dOne.X) == 500000). I would like to know if there is a way to solve this problem using a faster "vectorized" operation. Above is an exemple of what I want to do:

Data Frame ONE
X Y  V
a b  2
a c  3
a d  0
a e  0
b c  2
b d  3
b e  0
c d  2
c e  0
d e  0

Data Frame TWO
X Y  V
a b  1
a c  1
a d  1
b c  1
b d  1
c d  1
e d  1

Expected Data Frame after the columns are merged
X Y  V V2
a b  2  1
a c  3  1
a d  0  1
a e  0  0
b c  2  1
b d  3  1
b e  0  0
c d  2  1
c e  0  0
d e  0  1

This is the code I am using so far that is slow when dOne is big (hundreds of thousands or rows):

copyadjlistValueColumn <- function(dOne, dTwo) {
    dOne$V2 <- 0

    lv <- union(levels(dOne$Y), levels(dOne$X))

    dTwo$X <- factor(dTwo$X, levels = lv)
    dTwo$Y <- factor(dTwo$Y, levels = lv)
    dOne$X <- factor(dOne$X, levels = lv)
    dOne$Y <- factor(dOne$Y, levels = lv)

    for(i in 1:nrow(dTwo)) {
      row <- dTwo[i,]
      dOne$V2[dOne$X == row$X & dOne$Y == row$Y] <- row$V
      dOne$V2[dOne$X == row$Y & dOne$Y == row$X] <- row$V
    }
    dOne
}

This is a testthat test case that covers what I am expecting (using the data frames above):

test_that("Copy V column to another Data Frame", {
    dfOne <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
                        Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
                        V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0))

    dfTwo <- data.frame(X=c("a", "a", "a", "b", "b", "c", "e"),
                        Y=c("b", "c", "d", "c", "d", "d", "d"),
                        V=c(1, 1, 1, 1, 1, 1, 1))

    lv <- union(levels(dfTwo$Y), levels(dfTwo$X))
    dfExpected <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
                             Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
                             V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0),
                             V2=c(1, 1, 1, 0, 1, 1, 0, 1, 0, 1))
    dfExpected$X <- factor(dfExpected$X, levels = lv)
    dfExpected$Y <- factor(dfExpected$Y, levels = lv)

    dfMerged <- copyadjlistValueColumn(dfOne, dfTwo)

    expect_identical(dfMerged, dfExpected)
})

Any suggestion?

Thanks a lot :)

possible duplicate of [How to join data frames in R (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right) — talat, Nov 24 '14 at 13:17
What about `merge(dOne, dTwo, by = c("X", "Y"), all.x = TRUE)`? Though it is not exactly match your desired output for some reason — David Arenburg, Nov 24 '14 at 13:19
Hey David, I think it is because I have to match it in a "bidirectional" way: `dOne.X == dTwo.X & dOne.Y == dTwo.Y` and also `dOne.X == dTwo.Y & dOne.Y == dTwo.X` — alfakini, Nov 24 '14 at 13:21
This is *NOT* a duplicate of the linked question, so unless you have an alternative link, don't close this question — David Arenburg, Nov 24 '14 at 15:27

Henrik · Accepted Answer · 2014-11-25T08:06:28.803

Try two merge, where order of matching columns is reversed in the second, to get the 'bidirectional' matching. Then you may use e.g. rowSums to collapse the two created columns to one.

d1 <- merge(dfOne, dfTwo, by.x = c("X", "Y"), by.y = c("X", "Y"), all.x = TRUE)
d2 <- merge(d1, dfTwo, by.x = c("X", "Y"), by.y = c("Y", "X"), all.x = TRUE)
cbind(dfOne, V2 = rowSums(cbind(d2$V.y, d2$V), na.rm = TRUE))


#    X Y V V2
# 1  a b 2  1
# 2  a c 3  1
# 3  a d 0  1
# 4  a e 0  0
# 5  b c 2  1
# 6  b d 3  1
# 7  b e 0  0
# 8  c d 2  1
# 9  c e 0  0
# 10 d e 0  1

For faster alternatives to merge, check data.table and dplyr alternatives here: stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right/

@alf., Please note that I made a tiny change to get rid of the re`name` in the beginning. — Henrik, Nov 25 '14 at 08:04

score 2 · Answer 2 · edited Nov 24 '14 at 17:55

Here's a possible data.table package approach. This approach should be particularly efficient for a big data set like you have:

First convert to data.table object and add keys

library(data.table)
setkey(setDT(dfOne), X, Y)
setkey(setDT(dfTwo), X, Y)

Then perform a join on X & Y combination - the join is performed by matching key columns X,Y of dfOne with key columns X,Y of dfTwo respectively.

dfOne[dfTwo, V2 := i.V]

Now perform a join on Y & X combination - the join is performed by matching key columns X,Y of dfOne with key columns Y,X of dfTwo respectively.

setkey(dfTwo, Y, X)
dfOne[dfTwo, V2 := i.V][]

Result (I'll keep the unmatched as NAs instead of zeroes as it makes more sense this way):

#     X Y V V2
#  1: a b 2  1
#  2: a c 3  1
#  3: a d 0  1
#  4: a e 0 NA
#  5: b c 2  1
#  6: b d 3  1
#  7: b e 0 NA
#  8: c d 2  1
#  9: c e 0 NA
# 10: d e 0  1

@alf. Have you tried this approach? Didn't get any feedback from you. — David Arenburg, Nov 25 '14 at 08:28

junkka · Answer 3 · 2014-11-24T21:29:50.477

2

With dplyr:

library(dplyr)

left_join(dfOne, dfTwo, by = c("X", "Y")) %>% 
  left_join(dfTwo, by = c("X" = "Y", "Y" = "X")) %>% 
  mutate(V2 = ifelse(is.na(V.y), V, V.y)) %>% 
  select(X, Y, V = V.x, V2) %>% 
  do(replace(., is.na(.), 0))

edited Nov 24 '14 at 21:29

answered Nov 24 '14 at 21:16

junkka

543
7
11

Thanks junkka! I will stick with Henrik answer as it is easier to understand. – alfakini Nov 25 '14 at 01:09

Merging two data frames with different sizes by matching their columns

3 Answers3

Linked