data.table mapping based on another data.table

Question

I have two data(.xlsx), DT1 and DT2. I want to create a new column newcol in DT1 based on original column in DT1, mapping with columns in DT2.
I know this is ambiguous so I explain more here:
First, here is my two data.

DT1
code    type  
AH1     AM
AS5     AM
NMR     AM
TOS     AM
IP      AD
CC      ADCE
CA      Wa

DT2
code    year   month  
AH1     2011   2
AH1     2011   5
AS5     2012   7
AS5     2012   6
AS5     2013   3
CC      2014   6
CA      2016   11

Second, in DT2, column year and month are unimportant in this question. We don't need to cosider it.
Third, the result I want is:

DT2
code    year   month   newcol
AH1     2011   2       AM
AH1     2011   5       AM
AS5     2012   7       AM
AS5     2012   6       AM
AS5     2013   3       AM
CC      2014   6       ADCE
CA      2016   11      Wa

newcol in DT2 is created based on data DT1.
I saw a syntax like DT2[DT1, ...] to solve but I forget it. Any help?

Data

DT1 <- "  code    type  
1:        AH1     AM
2:        AS5     AM
3:        NMR     AM
4:        TOS     AM
5:        IP      AD
6:        CC      ADCE
7:        CA      Wa
"
DT1 <- read.table(text=DT1, header = T)
DT1 <- as.data.table(DT1)

DT2 <- "code    year   month  
1: AH1     2011   2
2: AH1     2011   5
3: AS5     2012   7
4: AS5     2012   6
5: AS5     2013   3
6: CC      2014   6
7: CA      2016   11
"
DT2 <- read.table(text=DT2, header =T)
DT2 <- as.data.table(DT2)

P.S. Moreover, in excel, there is a function VLOOKUP to solve it:

# Take first obs. as an example. 
DT2
code    year   month  
AH1     2011   2
# newcol is column D. So in D2, we type:
=VLOOKUP(TRIM(A1), 'DT1'!$A$2:$A$8, 2, FALSE)

UPDATE based on comment under @akrun's answer.
My original DT1 has 86 obs. and DT2 has 451125 obs. I use the @akrun's answer and DT2 reduces to 192409. So weird. DT2$code doesn't contain any NA. I don't know why.

length(unique(DT1$code1)) 
[1] 86
length(unique(DT2$code))
[1] 39

table(DT1$code1) 
AHI AHI002 AHI004 AHI005 AHS002 AHS003 AHS004 AHS005    AMR AMR002 AMR003 AMRHI3   CARD   CCRU  HPA01  HWPA1 HWPA1T    IOA  IOA01 
 1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1 
IOA01T IPA010 IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5 IPACC6   IPAR  IPAR2 IPARK2 IPARKI   NAHI  NAHI2   NAMR  NAMR2    NCC 
 1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1 
NCC2   NCC5  NCC5T  NNAHI NNAHI2  NNAMR NNAMR2     PL    PL2   PLFI    REI    SPA SPA001   SPA3   TADS  TADS2   TAHI  TAHI2   TAHS 
 1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1 
TAHS2   TAMB  TAMB2   TAMD  TAMD2   TAMR  TAMR2  TBURN TBURN2   TCCR   TFPS    TFS   TFS2    THE  THIBN THIBN2   TICU  TICU2   TIPA 
 1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1 
TIPA2  TIPAK TIPAK2   TNCC    TOS   TOS2   TSAO  TSAO2   TSPA    WED 
 1      1      1      1      1      1      1      1      1      1 

table(DT2$code)
AHI002 AHI005 AHS002 AHS005 AMR    AMR003 Card   HPA01  HWPA1  HWPA1T IOA01  IOA01T IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5 
19408  12215  34184  12226  19408  12215  19408   7344   9198    405   9198    405  12215   5137   1148   2853  31703   9198   7878 
IPACC6 IPAR   IPAR2  IPARK2 IPARKI NAHI   NAHI2  NAMR   NAMR2  NCC2   NCC5   NCC5T  NNAHI  NNAHI2 NNAMR  NNAMR2 PL     PL2    SPA    
9668  41909   9643   2362   2967  10018   3589  10018   3589   7878   2845    536  14776   8104  14754   8118  18624   8302  40856 
SPA3   
6823

try setting the code as keys of the two data.table, then index the `DT2` with the `DT1` e.g., `DT2[DT1]`. — din, Jun 12 '17 at 03:21
Re forgetting it, you can review the examples at the bottom of `?data.table`. — Frank, Jun 12 '17 at 04:58

score 6 · Answer 1 · edited Jun 12 '17 at 04:19

6

We can do this with join from data.table

library(data.table)
DT2[DT1, on = .(code), nomatch = 0]
#   code year month type
#1:  AH1 2011     2   AM
#2:  AH1 2011     5   AM
#3:  AS5 2012     7   AM
#4:  AS5 2012     6   AM
#5:  AS5 2013     3   AM
#6:   CC 2014     6 ADCE
#7:   CA 2016    11   Wa

edited Jun 12 '17 at 04:19

M--

25,431
8
61
93

answered Jun 12 '17 at 04:08

akrun

874,273
37
540
662

1

Yup, this is the correct way in data.table. – Frank Jun 12 '17 at 04:57
1

great. However, if `DT1` and `DT2` have different column name, how can I do? ex. `DT2` is `code`, but `DT1` is `code1` – Peter Chen Jun 12 '17 at 06:32
5

@PeterChen IN that case `DT2[DT1, on = .(code = code1), nomatch = 0]` – akrun Jun 12 '17 at 06:33
2

another small question: what does `nomatch = 0` mean? tks – Peter Chen Jun 12 '17 at 06:36
3

@PeterChen Otherwise, it will get all the rows that are not matching – akrun Jun 12 '17 at 06:40
3

If you want to learn something in r, learn it from @akrun. – M-- Jun 12 '17 at 17:43
my original `DT2` has 451125 obs. and `DT1` has 86 obs. I use the code and `DT2` reduces to 192409. So weird. `DT2$code` doesn't contain any `NA`. I don't know why – Peter Chen Jun 13 '17 at 04:01
@PeterChen Do you duplicates for 'code' in both datasets? – akrun Jun 13 '17 at 04:02
duplicates for 'code'? can u explain more? i don'y understand. thanks – Peter Chen Jun 13 '17 at 04:06
1

@PeterChen I meant if the 'code' column have frequency >1 for some observations in both datasets, it could happen. It is better to check on a subset of datasets from both and see if you get expected – akrun Jun 13 '17 at 04:08
`length(unique(DT1$code1))` = 86 and `length(unique(DT2$code))` = 39 – Peter Chen Jun 13 '17 at 04:09
1

@PeterChen check for `table(DT1$code1)` and `table(DT2$code)` – akrun Jun 13 '17 at 04:12
I update to my question – Peter Chen Jun 13 '17 at 04:25
@PeterChen Looks fine to me as you have unique elements in one of the datasets which frequency 1. are you getting the expected with the other solution – akrun Jun 13 '17 at 04:28
@PeterChen Can you check if there is leading/lagging spaces for 'code' in each of the dataset – akrun Jun 13 '17 at 06:23
1

I just found the problem. `DT1[DT2, on = .(code1 = code)]`. This one is right after remove white space. Thank you. – Peter Chen Jun 13 '17 at 06:31

M-- · Accepted Answer · 2017-06-12T03:49:23.290

3

You can use merge in base R:

DT2 <- (merge(DT1, DT2, by = 'code'))

Note: It'd also sort it by 'code' column.

You can also use plyr package:

DT2 <- plyr::join(DT2, DT1, by = "code")

As you are interested in using data.table package:

library(data.table)
DT2 <- data.table(DT2, key='code')
DT1 <- data.table(DT1, key='code')

DT2[DT1]

Or qdap package:

DT2$type <- qdap::lookup(DT2$code, DT1)

edited Jun 12 '17 at 03:49

answered Jun 12 '17 at 03:23

M--

25,431
8
61
93

your answer is work and awesome. However, my data is so large that `data.table` must faster than `plyr` or `dplyr`. – Peter Chen Jun 12 '17 at 03:40
@PeterChen Updated with two more options including `data.table`. – M-- Jun 12 '17 at 03:54
is `qdap` faster than `data.table`? – Peter Chen Jun 12 '17 at 04:05
1

No. `data.table` and `merge` are the fastest ones. I cannot check it right now but you can use `microbenchmark::microbenchmark(function1, function2, times=1000)` to check time of execution. – M-- Jun 12 '17 at 04:19

data.table mapping based on another data.table

Data

2 Answers2