6

Suppose I have two data tables:

X <- data.table(id = 1:5, L = letters[1:5])

   id L
1:  1 a
2:  2 b
3:  3 c
4:  4 d
5:  5 e

Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))

   id  L  N
1:  3 NA 10
2:  4  g NA
3:  5  h 12

Would it be possible to do a left outer join of X and Y by id using data table built-in functions? If not, I would like build a function (e.g. leftOuterJoin) with the following expected output:

leftOuterJoin(X, Y, on = "id")

   id  L  N
1:  1  a NA
2:  2  b NA
3:  3 NA 10
4:  4  g NA
5:  5  h 12

I have tried without success:

X[Y, on = "id"]

   id L i.L  N
1:  3 c  NA 10
2:  4 d   g NA
3:  5 e   h 12

I have also tried this, which is almost what I am looking for:

setkey(X, id)
setkey(Y, id)
merge(X, Y, all.x = TRUE)

   id L.x L.y  N
1:  1   a  NA NA
2:  2   b  NA NA
3:  3   c  NA 10
4:  4   d   g NA
5:  5   e   h 12
mat
  • 2,412
  • 5
  • 31
  • 69
  • @jogo Looks good but my problem is that I will have to reassign every variable that I want to update/add. That is what I would like to avoid given that my data tables have many columns. – mat Jun 08 '17 at 10:55

2 Answers2

18

It is an update join:

library(data.table)
X <- data.table(id = 1:5, L = letters[1:5])
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
X[Y, on=.(id), c("L", "N"):=.(i.L, i.N)][]
#    id  L  N
# 1:  1  a NA
# 2:  2  b NA
# 3:  3 NA 10
# 4:  4  g NA
# 5:  5  h 12

gives you the desired result.
Here I found a solution for multiple columns:

library(data.table)
X <- data.table(id = 1:5, L = letters[1:5])
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))

X[Y, on=.(id), names(Y)[-1]:=mget(paste0("i.", names(Y)[-1]))]

Another variant:

n <- names(Y)
X[Y, on=.(id), (n):=mget(paste0("i.", n))]
jogo
  • 12,469
  • 11
  • 37
  • 42
  • As I mentioned in my previous comment, this works. However I would like to avoid the reassignment given that my original data tables are very large and it would be cumbersome to rename every variable. If there's no other answer/update I'll accept yours. Thank you! – mat Jun 08 '17 at 11:04
  • That's a very nice improvement! I'll would just put it this way: `X[Y, on = "id", names(Y) %>% .[!. %in% "id"] := mget(paste0("i.", names(Y) %>% .[!. %in% "id"]))]` because "id" may not be the first column – mat Jun 08 '17 at 13:45
  • or simple do the update also for `id`: `n <- names(Y); X[Y, on=.(id), (n):=mget(paste0("i.", n))]` – jogo Jun 08 '17 at 13:49
2

I may have miss few things, please correct me if there is a better solution. I usually like to write function for such things.

Here one: the goal is to have all possibilities available. Join and update as well the join variable, use other variable names ...

> update.DT <- function(DATA1, DATA2, join.variable, overwrite.variable, overwrite.with.variable) {
+       
+       DATA1[DATA2, c(overwrite.variable) := mget(p0("i.", overwrite.with.variable)), on = join.variable][]
+       
+     }
> X <- X2 <- X3 <- data.table(id = 1:5, L = letters[1:5], PS = rep(59, 5))
> Y <- data.table(id = 3:5, id2 = 11:13, L = c("z", "g", "h"), PS = rep(61, 3))
> X
   id L PS
1:  1 a 59
2:  2 b 59
3:  3 c 59
4:  4 d 59
5:  5 e 59
> Y
   id id2 L PS
1:  3  11 z 61
2:  4  12 g 61
3:  5  13 h 61
> update.DT(DATA1 = X, DATA2 = Y, join.variable = "id", overwrite.variable = c("L"), overwrite.with.variable = c("L"))
   id L PS
1:  1 a 59
2:  2 b 59
3:  3 z 59
4:  4 g 59
5:  5 h 59
> update.DT(DATA1 = X2, DATA2 = Y, join.variable = "id", overwrite.variable = c("L", "PS"), overwrite.with.variable = c("L", "PS"))
   id L PS
1:  1 a 59
2:  2 b 59
3:  3 z 61
4:  4 g 61
5:  5 h 61
> update.DT(DATA1 = X2, DATA2 = Y, join.variable = "id", overwrite.variable = c("L", "PS", "id"), overwrite.with.variable = c("L", "PS", "id2"))
   id L PS
1:  1 a 59
2:  2 b 59
3: 11 z 61
4: 12 g 61
5: 13 h 61
Dorian Grv
  • 421
  • 5
  • 9