6

This is related to this question (Can I access repeated column names in `j` in a data.table join?), that was asked because I assumed that the opposite to this was true.

data.table with just 2 columns:

Suppose you wish to join two data.tables and then perform a simple operation on two joined columns, this can be done either in one or two calls to .[:

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))
DT2 = data.table(name = 1:N, value1 = rnorm(N))
setkey(DT1, name)

system.time({x = DT1[DT2, value1 - value]})     # One Step

system.time({x = DT1[DT2][, value1 - value]})   # Two Step

It turns out that making two calls - doing the join first, and then doing the subtraction - is noticeably quicker than all in one go.

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.67    0.00    0.67 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   0.14    0.01    0.16 

Why is this?

data.table with many columns:

If you put a LOT of columns in to the data.table then you do eventually find that the one step approach is quicker - presumably because data.table only uses the columns you reference in j.

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]})
system.time({x = DT1[DT2][, value1 - value]})

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.89    0.02    0.90 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   1.64    0.16    1.81 
Community
  • 1
  • 1
Corvus
  • 7,548
  • 9
  • 42
  • 68
  • 1
    I changed the title to something more direct. Hope it's alright. – Arun Jul 18 '13 at 09:30
  • 2
    I just want to confirm that you realize (not obvious from OP) that there is a hidden `by` in the first case and the two expressions give very different results in general – eddi Jul 18 '13 at 12:59
  • the two *types of expressions – eddi Jul 18 '13 at 13:13
  • @eddi I don't think I did realise - care to expand on that? – Corvus Jul 18 '13 at 13:20
  • See [this post](http://stackoverflow.com/questions/16093289/data-table-join-and-j-expression-unexpected-behavior) and the follow up [mailing list discussion](http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-quot-by-quot-td4664770.html). Expectation is that this behavior will become explicit in the future making the difference obvious. – eddi Jul 18 '13 at 14:33
  • The mailing list in particular has a couple of really good examples from @MatthewDowle highlighting the difference. – eddi Jul 18 '13 at 14:43

1 Answers1

9

I think this is due to the repeated subsetting DT1[DT2, value1-value] makes for every name in DT2. That is, you've to perform a j operation for each i here, as opposed to just one j operation after the join. This becomes quite costly with 1e6 unique entries. That is, [.data.table becomes significant and noticeable.

DT1[DT2][, value1-value] # similar to rowSums
DT1[DT2, value1-value]

In the first case, DT1[DT2], you perform the join first, and it is really fast. Of course, with more columns, as you show, you'll see a difference. But the point is performing the join once. But in the second case, you're grouping DT1 by DT2's name and for every one of them you're computing the difference. That is, you're subsetting DT1 for each value of DT2 - one 'j' operation per subset! You can see this better by just running this:

Rprof()
t1 <- DT1[DT2, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.96    97.96       0.98    100.00
# "-"                 0.02     2.04       0.02      2.04

Rprof()
t2 <- DT1[DT2][, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.22    84.62       0.26    100.00
# "-"                 0.02     7.69       0.02      7.69
# "is.unsorted"       0.02     7.69       0.02      7.69

This overhead in repeated subsetting seems to be overcome when you've too many columns and the join on many columns overtakes as the time-consuming operation. You can probably check this out yourself by profiling the other code.

Arun
  • 116,683
  • 26
  • 284
  • 387