This is related to this question (Can I access repeated column names in `j` in a data.table join?), that was asked because I assumed that the opposite to this was true.
data.table with just 2 columns:
Suppose you wish to join two data.tables
and then perform a simple operation on two joined columns, this can be done either in one or two calls to .[
:
N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))
DT2 = data.table(name = 1:N, value1 = rnorm(N))
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]}) # One Step
system.time({x = DT1[DT2][, value1 - value]}) # Two Step
It turns out that making two calls - doing the join first, and then doing the subtraction - is noticeably quicker than all in one go.
> system.time({x = DT1[DT2, value1 - value]})
user system elapsed
0.67 0.00 0.67
> system.time({x = DT1[DT2][, value1 - value]})
user system elapsed
0.14 0.01 0.16
Why is this?
data.table with many columns:
If you put a LOT of columns in to the data.table
then you do eventually find that the one step approach is quicker - presumably because data.table
only uses the columns you reference in j
.
N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]})
system.time({x = DT1[DT2][, value1 - value]})
> system.time({x = DT1[DT2, value1 - value]})
user system elapsed
0.89 0.02 0.90
> system.time({x = DT1[DT2][, value1 - value]})
user system elapsed
1.64 0.16 1.81