I created some code where I need to do a lot of joins in a loop with about 1M rows. At first I was making shallow copies with a left join, but I changed the code to make update join and it became slow.
Some test data
a=data.table(id=1:10^6)
b=data.table(id=1:10^6, t='y')
cols=setdiff(names(b),names(a))
c=copy(b[a,on=.(id)])
a[b,(cols):=mget(cols),on=.(id)]
all.equal(a,c)
[1]TRUE
I don't know if this is the correct way to compare times, but here it is
shallow copy
microbenchmark(c=b[a,on=.(id)])
Unit: milliseconds
expr min lq mean median uq max neval
c 120.5911 122.734 128.1071 124.0922 126.1449 179.0127 100
updating the data.table
microbenchmark(a[b,(cols):=mget(cols),on=.(id)])
Unit: milliseconds
expr min lq mean median uq max
a[b, `:=`((cols), mget(cols)), on = .(id)] 159.6128 162.9798 175.9855 168.5807 184.8756 308.5406
neval
100
altrough I will need a solution to update multiple columns, don't know why but, using mget
is slower than below approach
microbenchmark(a[b, t:=i.t, on=.(id)])
Unit: milliseconds
expr min lq mean median uq max neval
a[b, `:=`(t, i.t), on = .(id)] 138.0173 140.4671 146.3558 141.6204 143.281 185.1488 100
In this case I'm not concerned about memory usage just speed. Is there a way to update the original table without copies faster than the copy method?