28

I'm trying to match 4 variables pairwise and add a column with the lookup value. In base, I would do merge(df1,df2, by.x=c("lsr","ppr"),by.y=c("li","pro")), where df1 has 9 cols and df2 (2 being lsr and pro) df2 has only 3, li, pro, and the "value" I'm interested in, alpha.

This works fine, but as I'm beginning to be a huge fan of data.table, I would like to do this in the data.table way - and because I have some millions of rows - so base merge is slow (I saw, that the by.x, and by.y feature is pending for data.table, but maybe there is a workaround). See some sample data below:

df2:
         alpha         li        pro
      1: 0.5000000 0.01666667 0.01666667
      2: 0.3295455 0.03333333 0.01666667
      3: 0.2435897 0.05000000 0.01666667
      4: 0.1917808 0.06666667 0.01666667
      5: 0.1571429 0.08333333 0.01666667
df1:     
          demand rtime    mcv         mck        ppr       mlv         mlk        lsr
      1:    0.3     1 357.57700 0.099326944 0.01666667 558.27267 0.155075741 0.01666667
      2:    0.3    10 548.75433 0.152431759 0.01666667 614.30667 0.170640741 0.03333333
      3:    0.3    11 314.55767 0.087377130 0.01666667 636.48100 0.176800278 0.03333333
      4:    0.3     2 312.15033 0.086708426 0.01666667 677.48100 0.188189167 0.06666667
      5:    0.3     3 454.47867 0.126244074 0.01666667 608.92067 0.169144630 0.01666667
     ---                                                                               
6899196:    0.6     5 537.92673 0.149424093 1.00000000 537.92673 0.149424093 1.00000000
6899197:    0.6     6 277.34732 0.077040923 1.00000000 277.34732 0.077040923 1.00000000
6899198:    0.6     7  73.31484 0.020365235 1.00000000  73.31484 0.020365235 1.00000000
6899199:    0.6     8  32.04197 0.008900546 1.00000000  32.04197 0.008900546 1.00000000
6899200:    0.6     9  14.59008 0.004052799 1.00000000  14.59008 0.004052799 1.00000000

Last, maybe of interest is, that in df2 I have unique rows, and in df1, I have lots of duplicates in respect to lsr and ppr. I also tried to set two keys and join them the data.table way, and adding a new column with alpha. But without success.

M--
  • 25,431
  • 8
  • 61
  • 93
Puki Luki
  • 573
  • 1
  • 4
  • 13
  • If you don't need an outer join, you can just do a left, a right join or an inner join using binary join. The question is which one you want. Though judging by your base R code it seems like you wan an inner join? Also, which column you want to keep? You also don't have `pro` column in `df1` – David Arenburg May 21 '15 at 10:34
  • Thanks. Actually I believe, in this case the inner and the left joins are the same, as all combinations of lsr and ppr, or of li and pro are included in both df1 and df2. And not having pro in df1 is a typo!! Sorry for that, I meant, ppr in df1, I'll edit and correct it. I'd like to keep all columns from df1 and add alpha as a new column – Puki Luki May 21 '15 at 11:27
  • If all you need is a left join to `df1` while updating `alpha` column, it is a simple binary join. Try `setkey(df1, lsr, ppr) ; setkey(df2, li, pro) ; df1[df2, alpha := i.alpha] ; df1` – David Arenburg May 21 '15 at 11:44
  • Ha! thats it! Thanks a lot. I knew it was easy. Actually I tried that version, but I didn't know about the i. Just one follow-up, what's the need for the i. before alpha? – Puki Luki May 21 '15 at 11:58
  • 1
    So if you already have an `alpha` column in your `df1` set, `i` will tell `data.table` to get the information from the `alpha` in `df2`. It is always safer to put `i,` before the column names on the RHS in order to avoid possible unexpected results. – David Arenburg May 21 '15 at 12:00
  • 5
    OK. now I'm really a fan of data.table. Thanks! – Puki Luki May 21 '15 at 12:07
  • You have my upvote as fellow "great fan of data.table". That package is music to my fingers, poetry to my scripts! – Stefano Dec 16 '17 at 18:17

1 Answers1

28

You can use the statement provided by David Arenburg in comment:

setkey(df1, lsr, ppr)
setkey(df2, li, pro)
df1[df2, alpha := i.alpha]

From the current devel version, 1.9.5, we can perform joins directly without having to set keys using the on argument:

df1[df2, alpha := i.alpha, on = c(lsr="li", ppr="pro")]

If you don't want to install the devel version, then you can wait until this is pushed as v1.9.6 on CRAN.

Arun
  • 116,683
  • 26
  • 284
  • 387
jangorecki
  • 16,384
  • 4
  • 79
  • 160