3

I have two dataframes (actually data.tables).

set.seed(123)
dt1 <- data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt1
   P X
1: a 3
2: a 7
3: a 9
4: a 6
5: b 5
6: b 1
7: c 2
8: c 8
9: c 4

and:

dt2 <- data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))
dt2
    P D
 1: a X
 2: b Y
 3: c Z
 4: d G
 5: e F
 6: a X
 7: b Y
 8: c Z
 9: d G
10: e F

Now I want to add a new column to dt1, with column "D" of dt2 where P has the same value in dt1 and dt2. It should look like this:

dt_new
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z
beginneR
  • 3,207
  • 5
  • 30
  • 52

2 Answers2

7

I'd do a data.table join in this manner:

setkey(dt1, P)
dt1[unique(dt2),nomatch=0]

   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z
Arun
  • 116,683
  • 26
  • 284
  • 387
6

+1 to Arun's answer. To show the update-by-reference way to do this ...

Example data from question again:

set.seed(123)
dt1 = data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt2 = data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))

Removed dups in example data using unique() as Arun did :

dt2 = unique(dt2)  
dt2
   P D
1: a X
2: b Y
3: c Z
4: d G
5: e F

Now add D by reference to dt1 with data from dt2. Like a foreign key in SQL. Admittedly this syntax isn't obvious or particularly elegant but it does avoid the copy of dt1. So it can be significantly faster if dt1 is say 10GB in size.

setkey(dt2, P)
dt1[,D:={ .P=P           # allows us to refer to the P from dt1 on next line
          dt2[.P,D]$D}]  # since P is type character, no need to J() or .()
dt1
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

Or, keeping the duplicates in dt2:

set.seed(123)
dt1 = data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt2 = data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))
setkey(dt2,P)
dt2
    P D
 1: a X
 2: a X
 3: b Y
 4: b Y
 5: c Z
 6: c Z
 7: d G
 8: d G
 9: e F
10: e F
dt1[,D:={ .P=P
          dt2[.P,D,mult="first"]}]
dt1
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • I guess you mean the extra columns of `dt1` besides the join column are not copied this way? Did I understand you correctly? – eddi Aug 21 '13 at 05:37
  • @eddi Yes, none of `dt1`'s columns are copied, join columns or otherwise. Arun's answer returns a new data.table. The other natural by-reference way is `dt1[dt2,D:=i.D]` but I seem to remember we can't use `i.` notation in the right hand side of `:=` yet. That way needs `dt1` (the typically large table) to be keyed rather than `dt2` (the typically small lookup table). But on the other hand once `dt1` is keyed there would be many fewer rows to join from `dt2` so that would be faster. – Matt Dowle Aug 21 '13 at 08:37
  • None of the columns? Doesn't `.P=P` copy the join column? – eddi Aug 21 '13 at 12:40
  • 2
    @eddi No, no copy. `.P=P` is just a new binding of the symbol `.P` to the same object that `P` is bound to, as normal in R. The first part of [this answer](http://stackoverflow.com/a/14293056/403310) might help. No data.table here, just the same as base R. – Matt Dowle Aug 21 '13 at 15:23