0

I'm trying to fill up a large dataset with another dataset when ID's overlap. I have one data.table:

library(data.table)
dt1 <- data.table(id = c("a","b","c"),
              x = 1:3,
              y = c(1,NA,NA),
              z = c(1, NA, NA))

That has NA's in some variables. I have another data.table that has information about these variables:

dt2 <- data.table(id = c("b","c"),
              y = c(4,5),
              z = c(6,7))

and I want to combine the two by filling up dt1 with dt2. I'm not looking for workarounds (one is in p.s.) because I assume there is some clever join or merge way to do this. I just can't find it. In reality, both x and y+z are many columns so conditions based on one column don't help (though I can get the columns in a character vector or so).

goal <- data.table(id = c("a","b","c"),
              x = 1:3,
              y = c(1,4,5),
              z = c(1, 6, 7))

p.s.: Here is an ugly workaround but I think this kind of approach is error prone deep inside my function where columns change all the time etc:

adder <- merge(dt2, dt1[id %in% dt2[, id],id,x], by ="id", all = T)
done <- dt1[!is.na(y) & !is.na(z)]
rbind(adder, done)
Jakob
  • 1,325
  • 15
  • 31

1 Answers1

1

Would an update join work for your case?

library(data.table)

dt1 <- data.table(id = c("a","b","c"),
                  x = 1:3,
                  y = c(1,NA,NA),
                  z = c(1, NA, NA))


dt2 <- data.table(id = c("b","c"),
                  y = c(4,5),
                  z = c(6,7))


dt1[dt2, on=.(id), c("y", "z") := .(i.y, i.z)][]
#>    id x y z
#> 1:  a 1 1 1
#> 2:  b 2 4 6
#> 3:  c 3 5 7

Created on 2021-08-12 by the reprex package (v2.0.1)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • I knew it! Join! So you're creating a join on id and then renaming the i.... variables into the already existing variables? Great that this works, I didn't know about "update join" With the keyword, there is a duplicate: https://stackoverflow.com/questions/44433451/r-data-table-update-join – Jakob Aug 12 '21 at 09:34
  • 1
    @Jakob: The variables are not renamed. The `i` can be used to specify that the variables should come from the `data.table` in `i`, but this will also work without specifying `i ` as in `dt1[dt2, on=.(id), c("y", "z") := .(y, z)][]` – TimTeaFan Aug 12 '21 at 09:53