How to merge two large datasets while generate new column with different repeat value in r

Question

I have a question which drives me crazy and really need your help. The simplified question is this:

d1<-data.table(v1=c("a","b","c","d","d","b","a","c","a","d","b","a"),
                    v2=(seq(1:12)),V3=rep(1:4,times=3))

d2<-data.table(v1=c("a","b","c","d"),v3=c(3,2,1,4),v4=c("y","x","t","e"))

This will yield two data sets:

    D1:     
    v1 v2 V3
 1:  a  1  1
 2:  b  2  2
 3:  c  3  3
 4:  d  4  4
 5:  d  5  1
 6:  b  6  2
 7:  a  7  3
 8:  c  8  4
 9:  a  9  1
10:  d 10  2
11:  b 11  3
12:  a 12  4

> d2
   v1 v3 v4
1:  a  3  y
2:  b  2  x
3:  c  1  t
4:  d  4  e

As you could see that the elements in v1 and v3 is the same. Now I want to joint both data set by creating a new column in the D1 which return the value of V4 in d2 that match both indices v1 and v3, I hope I could get output looking like this:

>

 d3
    v1 v2 V3 V4
 1:  a  1  1 na
 2:  b  2  2  x
 3:  c  3  3 na
 4:  d  4  4  e
 5:  d  5  1 na
 6:  b  6  2  x
 7:  a  7  3  y
 8:  c  8  4 na
 9:  a  9  1 na
10:  d 10  2 na
11:  b 11  3 na
12:  a 12  4 na

The size of actual data I am using is relatively very large. It is something like joint 113MB data with 23MB. I tried to use for loop to do this problem by because the data is so long, it takes ages to finish the task. I also tried mergeand sqldf but both of them failed to finish the job. Could you please help me with this problem? Thank you very much!

The reason sqldf took so long is likely that you did not put indexes on the join columns. There are examples on the home page and there have been lots of questions on SO with examples as well. — G. Grothendieck, Nov 03 '14 at 22:28

score 8 · Answer 1 · answered Nov 03 '14 at 18:18

I'd do it like this:

setkey(d1, v1, V3) 
d1[d2, v4 := v4][]

For a join of the form x[i], key for x needs to be set. i may or may not have the key set. So we set the key for d1 here to columns v1 and V3.
Next, we perform a join d1[d2] which, for each row of d2 finds the rows that matches with the key columns of d1 and returns the join result. We're not looking for that result exactly. We'd rather like to add a new column where each matching row gets it's value from d2's v4 and otherwise NA. For this we make use of data.table's sub-assign by reference functionality. While joining i to x, we can still provide an expression in j, and refer to i's columns. You can also refer to them as i.v4 (usually used if there are columns of the same names in both x and i).
:= adds/updates a column by reference. The LHS of := is the column name we want to create here and the RHS v4 is the value we want to assign it from (here, it's the column from d2). For each matching row therefore, we assign d2's v4 onto d1's new column (which we name) v4 by reference (in-place, meaning no copy is made), and those rows with no matches will get the default value of NA.
The last [] is just to print the output to screen, as := returns the result invisibly.

Hope this helps to understand what's going on here.

Thank you very much for your detailed explanation, it works very well. Thanks again!!! — sxgn, Nov 04 '14 at 10:03

How to merge two large datasets while generate new column with different repeat value in r

1 Answers1

Linked