2

Following some great questions like this one: Why is using update on a lm inside a grouped data.table losing its model data?, I'm running regression within a data.table and storing it, as the following:

DT = data.table(iris)
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))), by = Species]

However, I would like to store the .J output as lm object lm output, and not as a data.table:

class(fit[Species=="setosa"])
#i would like fit to contain 3 lm objects, not data.tables!
# [1] "data.table" "data.frame" 

My question is, how can I store within fit 3 lm objects and not 3 data tables, the reason I need that, is that I want to further use fit for out sample prediction (using predict.lm)?

For example, I would like to store within the data table an element of the following type:

model<-lm(Sepal.Length ~ Sepal.Width + Petal.Length,data=DT[Species=="setosa"])
class(model)
# [1] "lm"
#i would like the first element of fit to inclide model -> the model output object
new_data<-DT #just a toy example :) this isnt really the new data 
predict(model,new_data)
Community
  • 1
  • 1
Yehoshaphat Schellekens
  • 2,305
  • 2
  • 22
  • 49
  • The value that is stored in `V1` is **not** a `data.table`, rather a `list` that indeed stores `lm` objects. You should do `class(fit[Species=="setosa", V1])` instead of just `class(fit[Species=="setosa"])`, because the later checks the whole data set, rather just `V1`. Also, you can check what `V1` stores by simply doing `fit[, lapply(V1, class)]`. Finally, you can easily use `predict` on the values in `V1`. Just do `fit[, lapply(V1, predict, new_data)]` – David Arenburg Aug 03 '16 at 10:11
  • 1
    I don't understand why you use `data.table` for the fits. Just use `nlme::lmList(Sepal.Length ~ Sepal.Width + Petal.Length | Species, data = iris)`. – Roland Aug 03 '16 at 10:14
  • 1
    @Roland , eventually i want to iterate over many subgroups, currently im doing it within a for loop, hopefully data.table would do it faster – Yehoshaphat Schellekens Aug 03 '16 at 10:16
  • 2
    I don't see any speed advantage in using data.table here. The slow part is not the grouping. The time is mostly spent in `lm`. To really get better performance, you should use `lm.fit` (in combination with data.table), but then you lose of course the convenience of using `summary`, `predict` etc. and have to code these yourself. PS: Learn to profile your code. – Roland Aug 03 '16 at 10:19
  • @Roland thanks for the profiling tip, I'm using for years R, and this is the first time I'm hearing about this :) – Yehoshaphat Schellekens Aug 03 '16 at 11:31
  • Continuing on Roland's point, for this type of exercise, you might consider a parallel `lapply` followed by `rbindlist`. Send different keyed subgroups to different threads. – MichaelChirico Aug 03 '16 at 17:56
  • @MichaelChirico can you show me an example, i would love to parallel that, ill consider it as a valid answer :) – Yehoshaphat Schellekens Aug 04 '16 at 06:42

0 Answers0