I could do directly a left join and then ifelse condition based on if the original col3 is missing like follows:
DT2new <- merge(DT2, DT1, by = c("col1", "col2"), all.x = T)
DT2new[, col3 := ifelse(is.na(col3.x), col3.y, col3.x)]
DT2new <- DT2new[, .(col1, col2, col3)]
# col1 col2 col3
#1: a a 4
#2: b a 6
#3: b d 2
#4: c a 8
#5: c c 2
#6: d c 3
#7: e b 1
#8: e d NA
Alternatively, a more efficient way is to do the operation by reference, which directly modifies (in place by reference) the DT2 data.table :
DT2[DT1, on = .(col1, col2), col3 := i.col3]
# col1 col2 col3
#1: b d 2
#2: e b 1
#3: c c 2
#4: e d NA
#5: b a 6
#6: c a 8
#7: d c 3
#8: a a 4
The i
in i.col3
refers to the i as in DT[i, j, by]
, hence to the column col3
in DT1
. This works because you do not have any cases when a non missing match for columns col1 and col2 in both data.tables would occur. If you have such cases you could do a more generic operation as follows (including examples of data.tables DT1 and DT2):
DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c", "e"),
col2 = c("b","d","c","a","d","a","c","a", "b"),
col3 = c(1,2,3,4,5,6,7,8, 22))
DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a"),
col2 = c("d","b","c","d","a","a","c","a"),
col3 = c(NA,1,2,NA,6,NA,3,NA))
You can see that DT1
has for col1 = "e"
and col2 = "b"
a value of 22. The DT2
has for col1 = "e"
and col2 = "b"
a value of 1. In order to give preference to DT2
if such clash occurs you would do:
DT2[DT1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]
Which gives you
# col1 col2 col3
#1: b d 2
#2: e b 1
#3: c c 2
#4: e d NA
#5: b a 6
#6: c a 8
#7: d c 3
#8: a a 4
The x.
in x.col3
refers to col3
in DT2
.
EDIT: A more efficient generic method is the vectorised data.table operation approach (in case that DT1 contains values for a certain pair of col1 and col2 on its own)
Given the usefull comments (@Frank & @chinsoon12), I double checked on the provided solution. As pointed out, the ifelse can get slow (reasons in the comment) and that is why the vectorised solution is a better way to go:
dt1 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
col3 = rep(c(3,1,2,4,6,5,3,6,2,NA), times = 26000))
dt2 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
col3 = rep(c(10,NA,22,51,60,NA,32,NA,28,NA), times = 26000))
dt2alt <- copy(dt2)
microbenchmark::microbenchmark(
NoIfelse = dt2[is.na(col3), col3 := dt1[dt2[is.na(col3)], on = .(col1, col2), col3]],
Ifelse = dt2alt[dt1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]
)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# NoIfelse 94.06635 100.2841 107.4140 103.7807 107.4006 154.7868 100 a
# Ifelse 219.94923 243.2651 254.3271 250.6967 266.8300 306.7213 100 b
identical(dt2, dt2alt) #TRUE