How to extract first n rows and calculate function per group using that subset, then calculate mean by different group?

Question

This is a follow up to my previous question: How to extract first n rows per group and calculate function using that subset?

Another relevant post as well: How to extract the first n rows per group?

I have the following data:

set.seed(1)
dt1 <- data.table(ticker="aa",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
dt2 <- data.table(ticker="aapl",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
dt3 <- data.table(ticker="abc",letters=sample(LETTERS,10^6,T),x=rnorm(2000,100,10),y=rnorm(2000,80,20))
myList <- list(dt1,dt2,dt3)

I want to apply a function to this data at specific indexes by group where the function output depends on the subsetted dataframe. I then want to group the resulting data.table by a different grouping variable and take a simple mean.

Do I want to calculate my function by group1 on the subsetted rows first, rbindlist the results, then calculate mean by group2?

Or do I want to rbindlist my entire data first, pre-select subsetted rows, then calculate my function by group1 then calculate means by group2?

# data.table version of function
dt_calc_perf <- function(dt){
  buy <- ifelse(dt$x > mean(dt$y),1,0)
  dt$perf <- buy*(dt$x/dt$y-1)
  return(dt)
}
# vector return version of function
calc_perf <- function(dt){
  buy <- ifelse(dt$x > mean(dt$y),1,0)
  perf <- buy*(dt$x/dt$y-1)
  return(perf)
}

# which is faster?

# method 1
method1 <- function(){
  res1 <- rbindlist(lapply(1:length(myList), 
                           function(m) dt_calc_perf(myList[[m]][1:1000])))
  res1 <- res1[,list('perf'=mean(perf),'tickers'=paste(ticker,collapse=',')),
               by=letters]
}

# method 2
dt <- rbindlist(myList)
x <- dt[dt[,.I[1:1000],by=ticker]$V1]

method2 <- function(){
  res2 <- x[,list('letters'=letters,'perf'= calc_perf(.SD)),by=ticker]
  res2 <- res2[,list('perf'=mean(perf),'tickers'=paste(ticker,collapse=',')),
               by=letters]

}

all.equal(method1(),method2())
[1] TRUE

with length(myList) = 3:

 microbenchmark(method1(),method2())
Unit: milliseconds
      expr      min       lq     mean   median       uq       max neval
 method1() 2.874678 2.976673 3.181134 3.031414 3.103259 10.266646   100
 method2() 3.008534 3.150086 3.352862 3.215517 3.292495  9.901859   100

with length(myList) = 12:

> myList <- list(dt1,dt2,dt3,dt1,dt2,dt3,dt1,dt2,dt3,dt1,dt2,dt3)
> microbenchmark(method1(),method2())
Unit: milliseconds
      expr      min       lq      mean   median        uq       max neval
 method1() 9.284757 9.655745 10.346527 9.786392 10.016470 17.044078   100
 method2() 3.020508 3.176173  3.330252 3.239680  3.322644  9.895444   100

EDIT:::

One thing to note is that my method function is eventually going to be fed into a genetic optimization algorithm where method will be called many-many times. My goal is to be able to compute calc_perf (which in reality is way more complex: inputs dt outputs vector perf) by subset and ticker. And then group that resulting dt by letters and calculate mean(perf).

What is the most efficient way to do this? Is there a faster implementation I’m missing? — road_to_quantdom, Oct 23 '18 at 10:08
your data is already split by ticket? or you're splitting it for method 1 — MichaelChirico, Oct 23 '18 at 11:13
my data is stored in a list. each item of the list is a data.table of a specific ticker — road_to_quantdom, Oct 23 '18 at 13:57

score 0 · Answer 1 · answered Oct 25 '18 at 09:17

0

Firstly, I think that subsetting count should be increased for benchmarks, so we can better see bottlenecks, so:

sn <- 100000

Secondly, when benchmarking, I think that rbindlist should be included in method2, so:

method2 <- function() {
  dt <- rbindlist(myList)
  x <- dt[dt[, .I[1:sn], by = ticker]$V1]
  res2 <- x[, list('letters' = letters, 'perf' = calc_perf(.SD[1:sn])),
            by = ticker]
  res2[, list('perf' = mean(perf),
              'tickers' = paste(ticker, collapse = ',')),
       by = letters]
}

My method, similar to method1, but different implementation of performance calculation:

method3 <- function() {
  require(hutils)
  dl <- lapply(myList, function(x) {
    x[1:sn][, perf := if_else(x > mean(y), x/y - 1, 0)]
  })
  x <- rbindlist(dl)
  x[, list('perf' = mean(perf),
           'tickers' = paste(ticker, collapse = ',')),
    by = letters]
}

Benchmarks:

# for data creation:
creatData <- function(x) {
  data.table(ticker = as.character(x), letters = sample(LETTERS, 10 ^ 6, T),
             x = rnorm(2000, 100, 10), y = rnorm(2000, 80, 20))
}
# create larger list:
set.seed(12)
myList <- lapply(1:40, creatData)

system.time(r1 <- method1()) # 1.84 - 2.55
system.time(r2 <- method2()) # 3.76 - 5.59
system.time(r3 <- method3()) # 1.46 - 1.62

all.equal(r1, r2) # T
all.equal(r1, r3) # T

answered Oct 25 '18 at 09:17

minem

3,640
2
15
29

I agree the subsetting being increased is a good thing to potentially change to see the bottleneck better. However, I do not want to include `rbindlist` in `method2`. My `method` function is eventually going to be fed into an genetic algorithm optimization. I am trying to minimize the amount absolutely necessary in `method` because optimization makes an insane amount of function calls. I will edit the question accordingly – road_to_quantdom Oct 25 '18 at 13:12
@road_to_quantdom how big is your real data? will you be able to `rbindlist` all of it? How complex is real `calc_perf `? maybe that function simply needs optimization? – minem Oct 25 '18 at 13:22
1

the `rbindlist`ed data is 74 million rows. This is still a subset of the larger data (although sufficient on its own), I am not sure if I will be able to `rbindlist` the entire data. `calc_perf` requires creation of 3 column vectors, and the information from 4 other columns in `dt`. – road_to_quantdom Oct 25 '18 at 13:44
It seems as though changing my function to use `:=` is the best approach to improve speed. It just feels strange to change a function that requires input `dt` to a function requiring a list of so many column vector names – road_to_quantdom Oct 25 '18 at 13:45
1

@road_to_quantdom maybe you can create separate question for `calc_perf` functions optimization, supplying real code? it also could benefit if you crated your example as close to reality as possible and supplying real operations you are doing – minem Oct 25 '18 at 13:54

How to extract first n rows and calculate function per group using that subset, then calculate mean by different group?

1 Answers1