match rows of two data.tables to fill subset of a data.table

Question

I've got two datatables:

> DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c")
                  , col2 = c("b","d","c","a","d","a","c","a")
                  , col3 = c(1,2,3,4,5,6,7,8))
> DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a")
                  , col2 = c("d","b","c","d","a","a","c","a")
                  , col3 = c(NA,1,2,NA,6,NA,3,NA))

> DT1
   col1 col2 col3
1:    a    b    1
2:    b    d    2
3:    b    c    3
4:    a    a    4
5:    c    d    5
6:    b    a    6
7:    a    c    7
8:    c    a    8

> DT2
   col1 col2 col3
1:    b    d   NA
2:    e    b    1
3:    c    c    2
4:    e    d   NA
5:    b    a    6
6:    c    a   NA
7:    d    c    3
8:    a    a   NA

I'd like to match the rows of DT2 for which col3 is NA to those of DT1, using col1 and col2 and fill the NA values of col3 in DT2 with those in DT1, if match exists.

> #desired Output
> DT2_output
   col1 col2 col3
1:    b    d    2
2:    e    b    1
3:    c    c    2
4:    e    d   NA
5:    b    a    6
6:    c    a    8
7:    d    c    3
8:    a    a    4

how do I do this using succinct data.table operation (no loops), as there are millions of rows in each data.table. I tried the following and it gives me error, which I believe has something to do with the which statements.

> ##doesn't work
> DT2[is.na(col3), col3 := DT1[which(col1 == DT2[is.na(col3),col1] && col2 == DT2[is.na(col3), col2]), col3]]

maybe `DT2[is.na(col3), col3 := DT1[DT2[is.na(col3)], col3, on=.(col1, col2)]]` — chinsoon12, Jul 13 '18 at 09:09

Patrik_P · Accepted Answer · 2018-07-13T22:17:10.090

I could do directly a left join and then ifelse condition based on if the original col3 is missing like follows:

DT2new <- merge(DT2, DT1, by = c("col1", "col2"), all.x = T)
DT2new[, col3 := ifelse(is.na(col3.x), col3.y, col3.x)]
DT2new <- DT2new[, .(col1, col2, col3)]

#   col1 col2 col3
#1:    a    a    4
#2:    b    a    6
#3:    b    d    2
#4:    c    a    8
#5:    c    c    2
#6:    d    c    3
#7:    e    b    1
#8:    e    d   NA

Alternatively, a more efficient way is to do the operation by reference, which directly modifies (in place by reference) the DT2 data.table :

DT2[DT1, on = .(col1, col2), col3 := i.col3]

#   col1 col2 col3
#1:    b    d    2
#2:    e    b    1
#3:    c    c    2
#4:    e    d   NA
#5:    b    a    6
#6:    c    a    8
#7:    d    c    3
#8:    a    a    4

The i in i.col3 refers to the i as in DT[i, j, by], hence to the column col3 in DT1. This works because you do not have any cases when a non missing match for columns col1 and col2 in both data.tables would occur. If you have such cases you could do a more generic operation as follows (including examples of data.tables DT1 and DT2):

DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c", "e"), 
                  col2 = c("b","d","c","a","d","a","c","a", "b"),
                  col3 = c(1,2,3,4,5,6,7,8, 22))
DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a"),
                  col2 = c("d","b","c","d","a","a","c","a"),
                  col3 = c(NA,1,2,NA,6,NA,3,NA))

You can see that DT1 has for col1 = "e" and col2 = "b" a value of 22. The DT2 has for col1 = "e" and col2 = "b" a value of 1. In order to give preference to DT2 if such clash occurs you would do:

DT2[DT1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]

Which gives you

#   col1 col2 col3
#1:    b    d    2
#2:    e    b    1
#3:    c    c    2
#4:    e    d   NA
#5:    b    a    6
#6:    c    a    8
#7:    d    c    3
#8:    a    a    4

The x. in x.col3 refers to col3 in DT2.

EDIT: A more efficient generic method is the vectorised data.table operation approach (in case that DT1 contains values for a certain pair of col1 and col2 on its own)

Given the usefull comments (@Frank & @chinsoon12), I double checked on the provided solution. As pointed out, the ifelse can get slow (reasons in the comment) and that is why the vectorised solution is a better way to go:

dt1 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
                  col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
                  col3 = rep(c(3,1,2,4,6,5,3,6,2,NA), times = 26000))

dt2 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
                  col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
                  col3 = rep(c(10,NA,22,51,60,NA,32,NA,28,NA), times = 26000))
dt2alt <- copy(dt2)

microbenchmark::microbenchmark(
NoIfelse = dt2[is.na(col3), col3 := dt1[dt2[is.na(col3)], on = .(col1, col2), col3]],
Ifelse = dt2alt[dt1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]
)

#Unit: milliseconds
#     expr       min       lq     mean   median       uq      max neval cld
# NoIfelse  94.06635 100.2841 107.4140 103.7807 107.4006 154.7868   100  a 
#   Ifelse 219.94923 243.2651 254.3271 250.6967 266.8300 306.7213   100   b

identical(dt2, dt2alt) #TRUE

Re the more generic case, there should be no need for ifelse. You can do `DT2[is.na(col3), col3 := DT1[DT2, on=.(col1, col2), x.col3]]` or similar (I didn't test it). Ifelse has poor performance and sometimes unpredictable output. — Frank, Jul 13 '18 at 14:41
Thanks @Frank I shall try it out will come back with a benchmark — Patrik_P, Jul 13 '18 at 14:44
Ok cool. Fwiw, some related benchmarks: https://stackoverflow.com/questions/30980545/alternative-of-slower-ifelse-in-r-data-table and https://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow — Frank, Jul 13 '18 at 14:46
Turns out to be a useful piece of information, thanks again! — Patrik_P, Jul 13 '18 at 22:12
@Patrik_P I've got a different situation with my data where multiple conditions need to apply for merging and this solution doesn't seem to work. https://stackoverflow.com/questions/54880869/conditional-data-table-match-for-subset-of-data-table I'm wondering what I am doing wrong here. thanks. — Ankhnesmerira, Feb 26 '19 at 08:17

match rows of two data.tables to fill subset of a data.table

1 Answers1

Linked