2

Hi I have some data in data.table format in R and I need to run some function.

Let say I have a data.table called A with columns, "name" "height", "weight".

I want to run some function, i.e. linear regression within data.table and store the coefficients, RMSE into the table results.

A[, .(beta = lm(height ~ weight)$coefficients[2], RMSE = 
     as.numeric(sqrt(crossprod(lm(height 
     ~ weight)$residuals)/(length(lm(height ~ weight)$residuals)- 
     (length(coef(lm(height ~ weight)))-1)))*100),
     by=.(name)]

My question: Is there a way to save the lm(height ~ weight) result as an object and then access this object's data so data.table don't need to run the lm function like 4 times in here?

This runs but it is a bit too slow compared to me using foreach and loop over "name" as I have millions rows of data.

Thanks.

Gabriel
  • 423
  • 6
  • 21
  • This does beg the question of "tidy" work (referencing much of the tidyverse) but with the speed/efficiencies of `data.table`. Interesting, I'll be looking for a good discussion/education on this! – r2evans Oct 31 '18 at 16:53
  • Currently if I run the lm function with getting the coefficients only, it takes 2 seconds. If I need to run lm function 4 times to calculate RMSE and it takes 12 seconds! I am so used to data.table syntax now, but as tidyverse is growing so big I might need to learn both! – Gabriel Oct 31 '18 at 17:01
  • 1
    Related: [data.table: anonymous function in j](https://stackoverflow.com/questions/25898162/data-table-anonymous-function-in-j). Use `{ }`, an anonymous body in `j`. Fill it with whatever function you wish (e.g. `lm`!). Finally, wrap desired return variables in `list( )` (or the dot alias `.( )`). – Henrik Oct 31 '18 at 17:37
  • 1
    See the examples in the answer here: [Using data.table to create a column of regression coefficients](https://stackoverflow.com/a/13906196/1851712), e.g. the second where `{ }` is used and an 'auxiliary` object is first created, then columns returned in `list(...)`. – Henrik Oct 31 '18 at 17:48
  • 1
    Also `lm()` is slow-ish as it does too much -- `lm.fit()` is a simpler alternative but you may then have to compute your own residuals. And you probably don't want to run `lm()` multiple times just for convenience. – Dirk Eddelbuettel Oct 31 '18 at 17:54

1 Answers1

0

By using anonymous body as suggested by Henrik, I am able to speed up the process!

A[, {model <- lm(height ~ weight)
       BETA <- model$coefficient[2]
       RMSE <- as.numeric(sqrt(crossprod(model$residuals)/(length(model$residuals)- 
               (length(coef(model))-1)))*100)

       list(BETA = BETA, RMSE = RMSE)
       },
 by = .(name)]

Apparently, an anonymous body (lambda) does not require a name and it is like "run once and forget". Inside this lambda, the lm() function is ran once (per group), and the result stored in an object.

We can then extract the required data from the model object and lastly list() is provided to let j convert the extracted data into columns.

Many thanks!

Henrik
  • 65,555
  • 14
  • 143
  • 159
Gabriel
  • 423
  • 6
  • 21