I am extremely impressed with how much improvement in speed I get for tapply
-like operations using data.table
compared to data frames.
For example:
df = data.frame(class = round(runif(1e6,1,1000)), x=rnorm(1e6))
DT = data.table(df)
# takes ages if somefun is complex
res1 = tapply(df$x, df$class, somefun)
# takes much faster
setkey(DT, class)
res2 = DT[,somefun(x),by=class]
However, I didn't quite manage to get it to work noticeably faster than data frames in apply
-like operations (i.e., cases, in which a function needs to be applied to each row).
df = data.frame(x1 = rnorm(1e6), x2=rnorm(1e6))
DT = data.table(df)
# takes ages if somefun is complex
res1 = apply(df, 1, somefun)
# not much improvement, if at all
DT[,rowid:=.I] # or: DT$rowid = 1:nrow(DT)
setkey(DT, rowid)
res2 = DT[,somefun1(x1,x2),by=rowid]
Is this really just to be expected or there are some tricks?