3

I can conduct a matched copy of a column in a data frame into another by something like

DF2$y <- DF1[match(DF2$id2, DF1$id1), "z"]  # DF1 and DF2 are data frames

where DF2$id2 is matched to DF1$id1. I would like to know what I can do with data tables for this kind of operation. My data tables have millions of rows and hundreds of columns. I've done setkey(DT1, id1) and setkey(DT2, id2).

This works:

DT2[, y := DT1[match(DT2$id2, DT1$id1), "z"]]  # DT1 and DT2 are data tables

but I am afraid that the match part might take long than necessary. (Or is it inevitable?)

I understand that I can also use column selection, merge, and renaming:

tmp <- DT1[, c("id1", "z")]  # column selection
DT3 <- merge(DT2, tmp, by.x = "id2", by.y = "id1", all.x = TRUE, suffixes = c("", ".y")) # merge
setnames(DT3, "z.y", "y")  # rename

(where the first two lines can be written on one line) but this seems a bit too complicated. Would there be a simpler and fast solution?

Thanks.

Example:

library(data.table)
DF1 <- data.frame(id1=2:4, x=LETTERS[1:3], z=11:13)
DF2 <- data.frame(id2=1:4, x=LETTERS[5:8], z=21:24)
DF1
#   id1 x  z
# 1   2 A 11
# 2   3 B 12
# 3   4 C 13
DF2
#   id2 x  z
# 1   1 E 21
# 2   2 F 22
# 3   3 G 23
# 4   4 H 24

DT1 <- data.table(DF1)
DT2 <- data.table(DF2)
setkey(DT1, id1)
setkey(DT2, id2)

DF2$y <- DF1[match(DF2$id2, DF1$id1), "z"]
DF2  # correct
#   id2 x  z  y
# 1   1 E 21 NA
# 2   2 F 22 11
# 3   3 G 23 12
# 4   4 H 24 13

DT2[, y := DT1[match(DT2$id2, DT1$id1), "z"]]
DT2
#    id2 x  z  y
# 1:   1 E 21 NA
# 2:   2 F 22 11
# 3:   3 G 23 12
# 4:   4 H 24 13
DT2[, y := NULL]

tmp <- DT1[, c("id1", "z")]
DT3 <- merge(DT2, tmp, by.x = "id2", by.y = "id1", all.x = TRUE, suffixes = c("", ".y"))
setnames(DT3, "z.y", "y")
DT3
#    id2 x  z  y
# 1:   1 E 21 NA
# 2:   2 F 22 11
# 3:   3 G 23 12
# 4:   4 H 24 13

## Simpler alternatives?
chan1142
  • 609
  • 4
  • 13
  • @GregorThomas Thanks. But I want to "merge" only one variable, not the full data set. For this I need to do column selection, merging, and then renaming as I explained in the question. Are there simpler ways? – chan1142 Sep 28 '21 at 17:53
  • 2
    `merge` is the simple. It's annoying about this particular example that the names of your ID columns don't match, and the name of the column you want to add needs to change, but renaming is a one-liner that is very efficient. – Gregor Thomas Sep 28 '21 at 17:58
  • 1
    And I would agree with your worry - I would expect `data.table::merge` to be much faster than `base::match`, especially on keyed tables. But I disagree with your assessment that renaming a couple columns is too complicated. – Gregor Thomas Sep 28 '21 at 18:01
  • 2
    Or a little more simply, `merge(DT2, setnames(DT1[, c("id1", "z")], c("id2", "y")), all.x = TRUE)` – Gregor Thomas Sep 28 '21 at 18:06
  • 1
    @GregorThomas Sure. Thanks. I wondered if there are other solutions, and I am glad to learn I need not look for other solutions. I've been wandering around with this issue so long. – chan1142 Sep 28 '21 at 18:10

2 Answers2

5

If I understand correctly, the OP wants to append column z from DT1 to DT2 as column y where the id columns match.

With , this can be solved using an update join:

library(data.table)
DT2[DT1, on = .(id2 = id1), y := i.z]
DT2
   id2 x  z  y
1:   1 E 21 NA
2:   2 F 22 11
3:   3 G 23 12
4:   4 H 24 13

Note that DT2 is updated by reference, i.e., without copying the whole data object. This might be handy for OP's large production datasets of millions of rows.

Caveat

This works because id1 and id2 are unique which is the case for the sample use case. So, make sure that you get what you want when you do update joins on duplicate values.

Let's see what will happen if there are duplicate values in the id1 column, e.g.

In case DT1 has id1 == 4 duplicated

(DT1 <- data.table(id1 = c(2:4, 4), x = LETTERS[1:4], z = 11:14))
   id1 x  z
1:   2 A 11
2:   3 B 12
3:   4 C 13
4:   4 D 14

then

DT2[DT1, on = .(id2 = id1), y := i.z][]

returns

   id2 x  z  y
1:   1 E 21 NA
2:   2 F 22 11
3:   3 G 23 12
4:   4 H 24 14

So, the update join

  • has not created additional rows in DT2 (which is probably what you may want to avoid copying of large datasets),
  • has picked the last occurrence of z in case of multiple matches.
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

I have two possible ideas:

  1. Using merge and select the columns and do the renaming there
DT3 <- merge(DT2, DT1[, .(id1, y = z)], by.x = "id2", by.y = "id1", all.x = TRUE)
  1. Using data.table's on syntax. I based this solution directly off of the answer here, which shows how to do a full outer join: https://stackoverflow.com/a/46904676/9244371
unique_keys <- unique(c(DT1[, id1], DT2[, id2]))
DT3 <- DT2[DT1[.(unique_keys), .(id1, y = z)]]
Hutch3232
  • 408
  • 4
  • 11