4

I am trying to "merge" column V of a Data Frame in another one if the columns X and Y are equals (I have to match dOne.X == dTwo.X & dOne.Y == dTwo.Y and also dOne.X == dTwo.Y & dOne.Y == dTwo.X) I solved this using a for loop, but it is slow when the Data Frame dOne is big (in my machine it takes 25 minutes if length(dOne.X) == 500000). I would like to know if there is a way to solve this problem using a faster "vectorized" operation. Above is an exemple of what I want to do:

Data Frame ONE
X Y  V
a b  2
a c  3
a d  0
a e  0
b c  2
b d  3
b e  0
c d  2
c e  0
d e  0

Data Frame TWO
X Y  V
a b  1
a c  1
a d  1
b c  1
b d  1
c d  1
e d  1

Expected Data Frame after the columns are merged
X Y  V V2
a b  2  1
a c  3  1
a d  0  1
a e  0  0
b c  2  1
b d  3  1
b e  0  0
c d  2  1
c e  0  0
d e  0  1

This is the code I am using so far that is slow when dOne is big (hundreds of thousands or rows):

copyadjlistValueColumn <- function(dOne, dTwo) {
    dOne$V2 <- 0

    lv <- union(levels(dOne$Y), levels(dOne$X))

    dTwo$X <- factor(dTwo$X, levels = lv)
    dTwo$Y <- factor(dTwo$Y, levels = lv)
    dOne$X <- factor(dOne$X, levels = lv)
    dOne$Y <- factor(dOne$Y, levels = lv)

    for(i in 1:nrow(dTwo)) {
      row <- dTwo[i,]
      dOne$V2[dOne$X == row$X & dOne$Y == row$Y] <- row$V
      dOne$V2[dOne$X == row$Y & dOne$Y == row$X] <- row$V
    }
    dOne
}

This is a testthat test case that covers what I am expecting (using the data frames above):

test_that("Copy V column to another Data Frame", {
    dfOne <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
                        Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
                        V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0))

    dfTwo <- data.frame(X=c("a", "a", "a", "b", "b", "c", "e"),
                        Y=c("b", "c", "d", "c", "d", "d", "d"),
                        V=c(1, 1, 1, 1, 1, 1, 1))

    lv <- union(levels(dfTwo$Y), levels(dfTwo$X))
    dfExpected <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
                             Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
                             V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0),
                             V2=c(1, 1, 1, 0, 1, 1, 0, 1, 0, 1))
    dfExpected$X <- factor(dfExpected$X, levels = lv)
    dfExpected$Y <- factor(dfExpected$Y, levels = lv)

    dfMerged <- copyadjlistValueColumn(dfOne, dfTwo)

    expect_identical(dfMerged, dfExpected)
})

Any suggestion?

Thanks a lot :)

Henrik
  • 65,555
  • 14
  • 143
  • 159
alfakini
  • 4,635
  • 2
  • 26
  • 35
  • possible duplicate of [How to join data frames in R (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right) – talat Nov 24 '14 at 13:17
  • What about `merge(dOne, dTwo, by = c("X", "Y"), all.x = TRUE)`? Though it is not exactly match your desired output for some reason – David Arenburg Nov 24 '14 at 13:19
  • 1
    Hey David, I think it is because I have to match it in a "bidirectional" way: `dOne.X == dTwo.X & dOne.Y == dTwo.Y` and also `dOne.X == dTwo.Y & dOne.Y == dTwo.X` – alfakini Nov 24 '14 at 13:21
  • 2
    This is *NOT* a duplicate of the linked question, so unless you have an alternative link, don't close this question – David Arenburg Nov 24 '14 at 15:27

3 Answers3

2

Try two merge, where order of matching columns is reversed in the second, to get the 'bidirectional' matching. Then you may use e.g. rowSums to collapse the two created columns to one.

d1 <- merge(dfOne, dfTwo, by.x = c("X", "Y"), by.y = c("X", "Y"), all.x = TRUE)
d2 <- merge(d1, dfTwo, by.x = c("X", "Y"), by.y = c("Y", "X"), all.x = TRUE)
cbind(dfOne, V2 = rowSums(cbind(d2$V.y, d2$V), na.rm = TRUE))


#    X Y V V2
# 1  a b 2  1
# 2  a c 3  1
# 3  a d 0  1
# 4  a e 0  0
# 5  b c 2  1
# 6  b d 3  1
# 7  b e 0  0
# 8  c d 2  1
# 9  c e 0  0
# 10 d e 0  1

For faster alternatives to merge, check data.table and dplyr alternatives here: stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right/

Henrik
  • 65,555
  • 14
  • 143
  • 159
2

Here's a possible data.table package approach. This approach should be particularly efficient for a big data set like you have:

First convert to data.table object and add keys

library(data.table)
setkey(setDT(dfOne), X, Y)
setkey(setDT(dfTwo), X, Y)

Then perform a join on X & Y combination - the join is performed by matching key columns X,Y of dfOne with key columns X,Y of dfTwo respectively.

dfOne[dfTwo, V2 := i.V]

Now perform a join on Y & X combination - the join is performed by matching key columns X,Y of dfOne with key columns Y,X of dfTwo respectively.

setkey(dfTwo, Y, X)
dfOne[dfTwo, V2 := i.V][]

Result (I'll keep the unmatched as NAs instead of zeroes as it makes more sense this way):

#     X Y V V2
#  1: a b 2  1
#  2: a c 3  1
#  3: a d 0  1
#  4: a e 0 NA
#  5: b c 2  1
#  6: b d 3  1
#  7: b e 0 NA
#  8: c d 2  1
#  9: c e 0 NA
# 10: d e 0  1
Arun
  • 116,683
  • 26
  • 284
  • 387
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
2

With dplyr:

library(dplyr)

left_join(dfOne, dfTwo, by = c("X", "Y")) %>% 
  left_join(dfTwo, by = c("X" = "Y", "Y" = "X")) %>% 
  mutate(V2 = ifelse(is.na(V.y), V, V.y)) %>% 
  select(X, Y, V = V.x, V2) %>% 
  do(replace(., is.na(.), 0))
junkka
  • 543
  • 7
  • 11