This is a follow up to my previous question: How to extract first n rows per group and calculate function using that subset?
Another relevant post as well: How to extract the first n rows per group?
I have the following data:
set.seed(1)
dt1 <- data.table(ticker="aa",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
dt2 <- data.table(ticker="aapl",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
dt3 <- data.table(ticker="abc",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
myList <- list(dt1,dt2,dt3)
I want to apply a function to this data at specific indexes by group where the function output depends on the subsetted dataframe. I then want to group the resulting data.table by a different grouping variable and take a simple mean.
Do I want to calculate my function by group1 on the subsetted rows first, rbindlist the results, then calculate mean by group2?
Or do I want to rbindlist my entire data first, pre-select subsetted rows, then calculate my function by group1 then calculate means by group2?
# data.table version of function
dt_calc_perf <- function(dt){
buy <- ifelse(dt$x > mean(dt$y),1,0)
dt$perf <- buy*(dt$x/dt$y-1)
return(dt)
}
# vector return version of function
calc_perf <- function(dt){
buy <- ifelse(dt$x > mean(dt$y),1,0)
perf <- buy*(dt$x/dt$y-1)
return(perf)
}
# which is faster?
# method 1
method1 <- function(){
res1 <- rbindlist(lapply(1:length(myList),
function(m) dt_calc_perf(myList[[m]][1:1000])))
res1 <- res1[,list('perf'=mean(perf),'tickers'=paste(ticker,collapse=',')),
by=letters]
}
# method 2
dt <- rbindlist(myList)
x <- dt[dt[,.I[1:1000],by=ticker]$V1]
method2 <- function(){
res2 <- x[,list('letters'=letters,'perf'= calc_perf(.SD)),by=ticker]
res2 <- res2[,list('perf'=mean(perf),'tickers'=paste(ticker,collapse=',')),
by=letters]
}
all.equal(method1(),method2())
[1] TRUE
with length(myList) = 3:
microbenchmark(method1(),method2())
Unit: milliseconds
expr min lq mean median uq max neval
method1() 2.874678 2.976673 3.181134 3.031414 3.103259 10.266646 100
method2() 3.008534 3.150086 3.352862 3.215517 3.292495 9.901859 100
with length(myList) = 12:
> myList <- list(dt1,dt2,dt3,dt1,dt2,dt3,dt1,dt2,dt3,dt1,dt2,dt3)
> microbenchmark(method1(),method2())
Unit: milliseconds
expr min lq mean median uq max neval
method1() 9.284757 9.655745 10.346527 9.786392 10.016470 17.044078 100
method2() 3.020508 3.176173 3.330252 3.239680 3.322644 9.895444 100
EDIT:::
One thing to note is that my method
function is eventually going to be fed into a genetic optimization algorithm where method
will be called many-many times. My goal is to be able to compute calc_perf
(which in reality is way more complex: inputs dt
outputs vector perf
) by subset and ticker
. And then group that resulting dt
by letters
and calculate mean(perf)
.