1

I have following function which runs 100s of times. This aggregation is the bottleneck in my code. Is it possible to make is faster with just using data.table or rewrite this function using rcpp?

  logit.gr <- function(DT){
    temp1 <- DT[, lapply(.SD, function(x) col1*sum(y*(x - sum(x*exp(col2))))), by = .(main_idx), .SDcols = c('col3','col4')]
    return(-colSums(temp1[, c('col3','col4'), with = F]))
  }

where DT is

DT <- data.table(main_idx = c(rep('A',4), rep('B', 5)), col1 = runif(9), col2 = -2+runif(9), col3 = 1+runif(9), col4 = 1+runif(9), y = runif(9))
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
deepAgrawal
  • 673
  • 1
  • 7
  • 25
  • 3
    Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Apr 09 '18 at 17:09
  • Thank you. I just did it – deepAgrawal Apr 09 '18 at 17:28
  • should it be `col3*sum(y*(x - sum(x*exp(col4)))))` instead of this `col1*sum(y*(x - sum(x*exp(col2)))))` – Sathish Apr 09 '18 at 18:00
  • Where are you using `theta` in your logic? – MKR Apr 09 '18 at 18:01
  • I was calling another function inside this function. I removed it since it was not the bottleneck. Forgot to remove theta. – deepAgrawal Apr 09 '18 at 18:04
  • @Sathish for the purpose of making it faster it does not matter which columns do you use. – deepAgrawal Apr 09 '18 at 18:07

1 Answers1

1

I think away to optimize is:

  1. sum should be added in function used in lapply itself. It will result in to only 1 row per main_idx in resultant data.table.
  2. chain of [ operator should be used to sum columns col3 and col4.
library(data.table)
DT[, lapply(.SD, function(x) sum(col1*sum(y*(x - sum(x*exp(col2)))))), 
   by = .(main_idx), .SDcols = c('col3','col4')][
         ,.(col3 = -sum(col3), col4 = -sum(col4))]
#Result
#     col3      col4 
#0.7575290 0.2423651 

Data:

DT <- data.table(main_idx = c(rep('A',4), rep('B', 5)), 
              col1 = runif(9), col2 = -2+runif(9), 
              col3 = 1+runif(9), col4 = 1+runif(9), y = runif(9))
MKR
  • 19,739
  • 4
  • 23
  • 33