8

So here is my real life problem which I feel like can be easily solved and I'm missing something obvious here. I have two big data sets called TK and DFT

library(data.table)
set.seed(123)
(TK <- data.table(venue_id = rep(1:3, each = 2), 
                  DFT_id = rep(1:3, 2), 
                  New_id = sample(1e4, 6),
                  key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        1      2   7883
# 3:        2      3   4089
# 4:        2      1   8828
# 5:        3      2   9401
# 6:        3      3    456

(DFT <- data.table(venue_id = rep(1:2, each = 2), 
                   DFT_id = 1:4, 
                   New_id = sample(4),
                   key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        1      2      4
# 3:        2      3      2
# 4:        2      4      1

I want to perform a binary left join to TK on the DFT_id column when venue_id %in% 1:2, while updating New_id by reference. In other words, the desired result would be

TK
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

I was thinking to combine both conditions, but it didn't work (still not sure why)

TK[venue_id %in% 1:2 & DFT, New_id := i.New_id][]
# Error in `[.data.table`(TK, DFT & venue_id %in% 1:2, `:=`(New_id, i.New_id)) : 
#   i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). 
# Please let datatable-help know if you'd like this, or add your comments to FR #1611.

My next idea was to use chaining which partially achieves the goal by joining correctly but on some temporary table without actually affecting TK

TK[venue_id %in% 1:2][DFT, New_id := i.New_id][]
TK
#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        2      1   8828
# 3:        1      2   7883
# 4:        3      2   9401
# 5:        2      3   4089
# 6:        3      3    456

So to make clear, I'm well aware that I can split TK into two tables, perform the join and then rbind again, but I'm doing many different conditional joins like this and I'm also looking for both speed and memory efficient solutions.

This also means that I am not looking for a dplyr solution as I'm trying to use both binary join and the update by reference features which only exist in the data.table package IIRC.


For additional information see these vignettes:

David Arenburg
  • 91,361
  • 17
  • 137
  • 196

2 Answers2

8

Copying from Arun's updated answer here

TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

His answer gives the details of what is going on.

Community
  • 1
  • 1
DaveTurek
  • 1,297
  • 7
  • 8
  • @Arun could you ellaborate on your comment on the other answer for how this can be used to update several columns? I can't seem to parse out how it can be done cleanly--using `.SDcols`, e.g. For now, I'm using `:=` and naming each explicitly. – MichaelChirico May 19 '15 at 22:09
5

Here's a very simple approach:

TK[DFT, New_id := fifelse(venue_id %in% 1:2, i.New_id, New_id)][]
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

I haven't checked, but I suspect the other answer is faster.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
eddi
  • 49,088
  • 6
  • 104
  • 155
  • When you've to update more than one column as well, the other one is nicer.. I find. – Arun Apr 16 '15 at 08:16
  • 2
    @Arun I agree. Although tbh neither solution is particularly aesthetically pleasing to me, I almost want OP's original attempt to work, except that it makes no sense. – eddi Apr 16 '15 at 08:39
  • Smth like this would be more palatable for me if it worked: `TK[venue_id %in% 1:2, .SD[DFT, New_id := i.New_id]]` – eddi Apr 16 '15 at 08:42
  • it's not *that* bad ;-), but I agree it would be nice. although I'm not sure if it's possible (or how) with the current design.. – Arun Apr 16 '15 at 09:02
  • Why my approach makes "*no sense*"? It makes perfect sense to me. `i` expression is trying to evaluate `DFT` and when not found, performs a binary join on `DFT` in the global environment (pretty much what happens in a regular binary join I believe) and the same process happens with `venue_id`, though `venue_id` is found in the parent environment and is being evaluated there. Evaluating from different environments is something acceptable within the `data.table` environment, for example `vec <- 3 ; DT <- data.table(A = c("a", "a", "b", "b"), B = 1:4) ; DT[B %in% vec & A == "b"]` – David Arenburg Apr 16 '15 at 11:41
  • I get an error with this, unless I add `allow.cartesian=TRUE`. Perhaps because I'm using version 1.9.4? – DaveTurek Apr 16 '15 at 12:00
  • @DaveTurek, yes. It is a recent change allowing duplicated Id's. – David Arenburg Apr 16 '15 at 14:04
  • 1
    @DavidArenburg I really want it to make sense, but if that worked it would completely violate regular R syntax - how is one to know that your "&" there is a new magical kind of an *and* operator, instead of the regular one, that would literally apply the operation between the vector and the `data.table`? You could though introduce a new operator, call it let's say `%and%`, and then it would make perfect sense. – eddi Apr 17 '15 at 01:20
  • I can confirm (anecdotally, no benchmarking) the other answer wins on speed, _especially_ if you've to update more than one column conditionally – MichaelChirico May 27 '15 at 20:10