5

This is a follow up to my previous question: How to extract first n rows per group and calculate function using that subset?

Another relevant post as well: How to extract the first n rows per group?

I have the following data:

set.seed(1)
dt1 <- data.table(ticker="aa",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
dt2 <- data.table(ticker="aapl",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
dt3 <- data.table(ticker="abc",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
myList <- list(dt1,dt2,dt3)

I want to apply a function to this data at specific indexes by group where the function output depends on the subsetted dataframe. I then want to group the resulting data.table by a different grouping variable and take a simple mean.

Do I want to calculate my function by group1 on the subsetted rows first, rbindlist the results, then calculate mean by group2?

Or do I want to rbindlist my entire data first, pre-select subsetted rows, then calculate my function by group1 then calculate means by group2?

# data.table version of function
dt_calc_perf <- function(dt){
  buy <- ifelse(dt$x > mean(dt$y),1,0)
  dt$perf <- buy*(dt$x/dt$y-1)
  return(dt)
}
# vector return version of function
calc_perf <- function(dt){
  buy <- ifelse(dt$x > mean(dt$y),1,0)
  perf <- buy*(dt$x/dt$y-1)
  return(perf)
}

# which is faster?

# method 1
method1 <- function(){
  res1 <- rbindlist(lapply(1:length(myList), 
                           function(m) dt_calc_perf(myList[[m]][1:1000])))
  res1 <- res1[,list('perf'=mean(perf),'tickers'=paste(ticker,collapse=',')),
               by=letters]
}

# method 2
dt <- rbindlist(myList)
x <- dt[dt[,.I[1:1000],by=ticker]$V1]

method2 <- function(){
  res2 <- x[,list('letters'=letters,'perf'= calc_perf(.SD)),by=ticker]
  res2 <- res2[,list('perf'=mean(perf),'tickers'=paste(ticker,collapse=',')),
               by=letters]

}

all.equal(method1(),method2())
[1] TRUE

with length(myList) = 3:

 microbenchmark(method1(),method2())
Unit: milliseconds
      expr      min       lq     mean   median       uq       max neval
 method1() 2.874678 2.976673 3.181134 3.031414 3.103259 10.266646   100
 method2() 3.008534 3.150086 3.352862 3.215517 3.292495  9.901859   100

with length(myList) = 12:

> myList <- list(dt1,dt2,dt3,dt1,dt2,dt3,dt1,dt2,dt3,dt1,dt2,dt3)
> microbenchmark(method1(),method2())
Unit: milliseconds
      expr      min       lq      mean   median        uq       max neval
 method1() 9.284757 9.655745 10.346527 9.786392 10.016470 17.044078   100
 method2() 3.020508 3.176173  3.330252 3.239680  3.322644  9.895444   100

EDIT:::

One thing to note is that my method function is eventually going to be fed into a genetic optimization algorithm where method will be called many-many times. My goal is to be able to compute calc_perf (which in reality is way more complex: inputs dt outputs vector perf) by subset and ticker. And then group that resulting dt by letters and calculate mean(perf).

road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20

1 Answers1

0

Firstly, I think that subsetting count should be increased for benchmarks, so we can better see bottlenecks, so:

sn <- 100000

Secondly, when benchmarking, I think that rbindlist should be included in method2, so:

method2 <- function() {
  dt <- rbindlist(myList)
  x <- dt[dt[, .I[1:sn], by = ticker]$V1]
  res2 <- x[, list('letters' = letters, 'perf' = calc_perf(.SD[1:sn])),
            by = ticker]
  res2[, list('perf' = mean(perf),
              'tickers' = paste(ticker, collapse = ',')),
       by = letters]
}

My method, similar to method1, but different implementation of performance calculation:

method3 <- function() {
  require(hutils)
  dl <- lapply(myList, function(x) {
    x[1:sn][, perf := if_else(x > mean(y), x/y - 1, 0)]
  })
  x <- rbindlist(dl)
  x[, list('perf' = mean(perf),
           'tickers' = paste(ticker, collapse = ',')),
    by = letters]
}

Benchmarks:

# for data creation:
creatData <- function(x) {
  data.table(ticker = as.character(x), letters = sample(LETTERS, 10 ^ 6, T),
             x = rnorm(2000, 100, 10), y = rnorm(2000, 80, 20))
}
# create larger list:
set.seed(12)
myList <- lapply(1:40, creatData)

system.time(r1 <- method1()) # 1.84 - 2.55
system.time(r2 <- method2()) # 3.76 - 5.59
system.time(r3 <- method3()) # 1.46 - 1.62

all.equal(r1, r2) # T
all.equal(r1, r3) # T
minem
  • 3,640
  • 2
  • 15
  • 29
  • I agree the subsetting being increased is a good thing to potentially change to see the bottleneck better. However, I do not want to include `rbindlist` in `method2`. My `method` function is eventually going to be fed into an genetic algorithm optimization. I am trying to minimize the amount absolutely necessary in `method` because optimization makes an insane amount of function calls. I will edit the question accordingly – road_to_quantdom Oct 25 '18 at 13:12
  • @road_to_quantdom how big is your real data? will you be able to `rbindlist` all of it? How complex is real `calc_perf `? maybe that function simply needs optimization? – minem Oct 25 '18 at 13:22
  • 1
    the `rbindlist`ed data is 74 million rows. This is still a subset of the larger data (although sufficient on its own), I am not sure if I will be able to `rbindlist` the entire data. `calc_perf` requires creation of 3 column vectors, and the information from 4 other columns in `dt`. – road_to_quantdom Oct 25 '18 at 13:44
  • It seems as though changing my function to use `:=` is the best approach to improve speed. It just feels strange to change a function that requires input `dt` to a function requiring a list of so many column vector names – road_to_quantdom Oct 25 '18 at 13:45
  • 1
    @road_to_quantdom maybe you can create separate question for `calc_perf` functions optimization, supplying real code? it also could benefit if you crated your example as close to reality as possible and supplying real operations you are doing – minem Oct 25 '18 at 13:54