merge-like scenario with two data.tables

Question

I have two dataframes (actually data.tables).

set.seed(123)
dt1 <- data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt1
   P X
1: a 3
2: a 7
3: a 9
4: a 6
5: b 5
6: b 1
7: c 2
8: c 8
9: c 4

and:

dt2 <- data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))
dt2
    P D
 1: a X
 2: b Y
 3: c Z
 4: d G
 5: e F
 6: a X
 7: b Y
 8: c Z
 9: d G
10: e F

Now I want to add a new column to dt1, with column "D" of dt2 where P has the same value in dt1 and dt2. It should look like this:

dt_new
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

Please fix your syntax so your code can run. – Frank Aug 20 '13 at 22:26 — Frank, Aug 20 '13 at 22:26

score 7 · Accepted Answer · answered Aug 20 '13 at 22:31

7

I'd do a data.table join in this manner:

setkey(dt1, P)
dt1[unique(dt2),nomatch=0]

   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

answered Aug 20 '13 at 22:31

Arun

116,683
26
284
387

score 6 · Answer 2 · answered Aug 21 '13 at 04:06

+1 to Arun's answer. To show the update-by-reference way to do this ...

Example data from question again:

set.seed(123)
dt1 = data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt2 = data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))

Removed dups in example data using unique() as Arun did :

dt2 = unique(dt2)  
dt2
   P D
1: a X
2: b Y
3: c Z
4: d G
5: e F

Now add D by reference to dt1 with data from dt2. Like a foreign key in SQL. Admittedly this syntax isn't obvious or particularly elegant but it does avoid the copy of dt1. So it can be significantly faster if dt1 is say 10GB in size.

setkey(dt2, P)
dt1[,D:={ .P=P           # allows us to refer to the P from dt1 on next line
          dt2[.P,D]$D}]  # since P is type character, no need to J() or .()
dt1
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

Or, keeping the duplicates in dt2:

set.seed(123)
dt1 = data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt2 = data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))
setkey(dt2,P)
dt2
    P D
 1: a X
 2: a X
 3: b Y
 4: b Y
 5: c Z
 6: c Z
 7: d G
 8: d G
 9: e F
10: e F
dt1[,D:={ .P=P
          dt2[.P,D,mult="first"]}]
dt1
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

I guess you mean the extra columns of `dt1` besides the join column are not copied this way? Did I understand you correctly? — eddi, Aug 21 '13 at 05:37
@eddi Yes, none of `dt1`'s columns are copied, join columns or otherwise. Arun's answer returns a new data.table. The other natural by-reference way is `dt1[dt2,D:=i.D]` but I seem to remember we can't use `i.` notation in the right hand side of `:=` yet. That way needs `dt1` (the typically large table) to be keyed rather than `dt2` (the typically small lookup table). But on the other hand once `dt1` is keyed there would be many fewer rows to join from `dt2` so that would be faster. — Matt Dowle, Aug 21 '13 at 08:37
@eddi No, no copy. `.P=P` is just a new binding of the symbol `.P` to the same object that `P` is bound to, as normal in R. The first part of [this answer](http://stackoverflow.com/a/14293056/403310) might help. No data.table here, just the same as base R. — Matt Dowle, Aug 21 '13 at 15:23

merge-like scenario with two data.tables

2 Answers2

Linked