2

I am extremely impressed with how much improvement in speed I get for tapply-like operations using data.table compared to data frames.

For example:

df = data.frame(class = round(runif(1e6,1,1000)), x=rnorm(1e6))
DT = data.table(df)

# takes ages if somefun is complex
res1 = tapply(df$x, df$class, somefun) 

# takes much faster 
setkey(DT, class)
res2 = DT[,somefun(x),by=class] 

However, I didn't quite manage to get it to work noticeably faster than data frames in apply-like operations (i.e., cases, in which a function needs to be applied to each row).

df = data.frame(x1 = rnorm(1e6), x2=rnorm(1e6))
DT = data.table(df)

# takes ages if somefun is complex
res1 = apply(df, 1, somefun) 

# not much improvement, if at all 
DT[,rowid:=.I] # or: DT$rowid = 1:nrow(DT)
setkey(DT, rowid)
res2 = DT[,somefun1(x1,x2),by=rowid] 

Is this really just to be expected or there are some tricks?

msp
  • 1,059
  • 1
  • 9
  • 17
  • 3
    `apply` will coerce to matrix internally (copies made, columns must all be the same ). `data.table` has a lot of overhead -- you really want to vectorize somefun – mnel May 24 '13 at 06:29
  • 1
    I wish you'll have some fun vectorizing somefun. – juba May 24 '13 at 07:38
  • Unfortunately, I can only imagine "vectorizing" it by iterating over the row number rowid and then running a parallel computation for df[rowid,] in mclapply. This does the job, but is quite resource-intensive. – msp May 24 '13 at 08:04
  • 1
    This is not reproducible but even with what is shown we can see that in the second half of the post the two codes are not comparable -- in the computation of `res1` one argument is passed to `somefun` and in the case of `res2` two arguments are passed. – G. Grothendieck May 24 '13 at 12:17
  • Yes, indeed somefun will have to be written a bit differently in either case, but I have the impression others understood what I meant. Have changed somefun to somefun1 in the last example though to avoid the confusion. And yes, mnel has opened up the whole new world to me with vectorization, as I'm now realising - thanks a lot! – msp May 24 '13 at 14:52
  • This might be a crazy notion but would there be a way to reshape/transpose the data first? – Dean MacGregor May 30 '13 at 12:37

2 Answers2

5

If you cannot vectorize your function (because of recursivity...) then you fall in Rcpp territory. Usual rule to use Rcpp and data.table is

  1. shape your data.table accordingly (setkey...)
  2. write you C?C++ function say f that would take a Rcpp::DataFrame and return a Rcpp::List
  3. update by reference doing cppOutList <- f(DT), DT[,names(cppOutList):=cppOutList]

Doing this usually make you save orders of magnitude

statquant
  • 13,672
  • 21
  • 91
  • 162
0

You can probably improve a lot the speed of row-wise operations using set. There is a good benchmark here: Row operations in data.table using `by = .I`

Community
  • 1
  • 1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109