13

Let's say I have two data tables (dt1 and dt2), and I want to get dt3 using data tables. A,B,C,E,F,G,H are column names. dt1 key is column A, and dt2 key is column E. Data tables have different number of rows. I want to keep all the columns from DT1, and add only one column (H) from DT2 to the joined data table. Eventually, I will store this as DT1 (though I showed it as dt3 below).

How can I achieve it with data tables? I have an ugly solution with merge + data frames.

dt1 
A   B   C   
1   4   7   
2   5   8   
3   6   9   
2   20  21

dt2
E   F   G   H
1   10  13  16
3   12  15  18    
2   11  14  17


dt3
A   B   C   H
1   4   7   16
2   5   8   17
3   6   9   18
2   20  21  17          
zx8754
  • 52,746
  • 12
  • 114
  • 209
user2649059
  • 143
  • 1
  • 6
  • 2
    You want to do a left join to `dt1` and add a new column to it or you just want an outer join and create a new data set? If the first option is what you looking for, try `setkey(setDT(dt1), A) ; dt1[dt2, H := i.H]` – David Arenburg Jun 18 '15 at 11:00
  • I tried to clarify my question. I want to keep all rows and columns from dt1. I just want to add a column from dt2. thanks – user2649059 Jun 18 '15 at 11:22
  • So why did you create `df3` then? Also, didn't my solution work for you? It should do exactly what you need. – David Arenburg Jun 18 '15 at 11:23
  • Just ignore dt3 (was trying to make my example cleaner). Your solution works!! Thanks a lot... Just one question, how come we did not need to setkey for dt2? Is it because we use first column of dt2 to join by default? – user2649059 Jun 18 '15 at 11:34
  • Yes, exactly. Though you can key `df2` by `E` for safety if you like. – David Arenburg Jun 18 '15 at 11:39

2 Answers2

24

In order to perform a left join to df1 and add H column from df2, you can combine binary join with the update by reference operator (:=)

setkey(setDT(dt1), A) 
dt1[dt2, H := i.H]

See here and here for detailed explanation on how it works


With the devel version (v >= 1.9.5) we could make it even shorter by specifying the key within setDT (as pointed by @Arun)

setDT(dt1, key = "A")[dt2, H := i.H]

Edit 24/7/2015

You can now run a binary join using the new on parameter without setting keys

setDT(dt1)[dt2, H := i.H, on = c(A = "E")]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Could you please help me understand the use of "i.H"? – Chris Jun 03 '19 at 05:51
  • @Chris `i.` just tells data.table where to take the column from when you are doing a join. If we have `X[Y]` then the `Y` table is in `X`s `i` location. This is useful when both tables have columns with same names. – David Arenburg Jun 03 '19 at 06:57
  • @DavidArenburg those links are broken. Is there anyway to only keep columns from dt1? – Herman Toothrot Aug 04 '20 at 16:44
  • @HermanToothrot not sure what you mean. This solution already keeps only the columns from `dt1`. – David Arenburg Aug 04 '20 at 16:47
  • @DavidArenburg sorry I wasn't clear, I want to keep only columns from dt1 and no columns from dt2. – Herman Toothrot Aug 05 '20 at 11:41
  • @HermanToothrot I understood you. When you run `setDT(dt1)[dt2, H := i.H, on = c(A = "E")]` you are only keeping the columns from `dt1` in `dt1`, while updating the `H` column from the `H` in `dt2`. Did you try the code? Perhaps post a new question as I don't understand your issue. – David Arenburg Aug 05 '20 at 12:06
4

data.table solution

setDT(dt1)[ , H := dt2$H[match(dt1$A , dt2$E)] , ]

#    A  B  C  H
# 1: 1  4  7 16
# 2: 2  5  8 17
# 3: 3  6  9 18
# 4: 2 20 21 17

another dplyr solution will be

left_join(x = dt1 , y = dt2 , by = c("A" = "E")) %>% 
select(one_of(c("A" , "B" , "C" , "H")))
Nader Hisham
  • 5,214
  • 4
  • 19
  • 35