Apply grouped model back onto data

Question

I fit models like so

groupedTrainingSet = group_by(trainingSet, geo);
models = do(groupedTrainingSet, mod = lm(revenue ~ julian, data=.))

grouptedTestSet = group_by(testSet, geo);
// TODO: apply model back to test set

Where models looks like

 geo     mod
1   APAC <S3:lm>
2  LATAM <S3:lm>
3     ME <S3:lm>
7    ROW <S3:lm>
4     WE <S3:lm>
5     NA <S3:lm>

I think I should be able to just apply 'do' again but I'm not seeing it...Alternatively I can do something along the lines of

apply(trainingData, fitted =
    predict(select(models, geo==geo)$mod, .));

But I'm not sure about the syntax there.

score 9 · Accepted Answer · answered Jun 24 '14 at 00:33

9

Here is a dplyr method of obtaining a similar answer, following the approach used by @Mike.Gahan :

library(dplyr) 

iris.models <- iris %>%
  group_by(Species) %>%
  do(mod = lm(Sepal.Length ~ Sepal.Width, data = .))

iris %>% 
  tbl_df %>%
  left_join(iris.models) %>%
  rowwise %>%
  mutate(Sepal.Length_pred = predict(mod,
                                    newdata = list("Sepal.Width" = Sepal.Width)))

alternatively you can do it in one step if you create a predicting function:

m <- function(df) {
  mod <- lm(Sepal.Length ~ Sepal.Width, data = df)
  pred <- predict(mod,newdata = df["Sepal.Width"])
  data.frame(df,pred)
}

iris %>%
  group_by(Species) %>%
  do(m(.))

answered Jun 24 '14 at 00:33

AndrewMacDonald

2,870
1
18
31

What's the point of the `tbl_df` command? I have looked at the documentation, but don't see how it applies. – gregmacfarlane Jul 16 '14 at 21:04
Doesn't make much difference in this case; it has become habit for me when using `dplyr`, because of its more convenient printing method. If that line is omitted, everything should work in the same way. – AndrewMacDonald Jul 17 '14 at 18:19
I was about to ask for extension to your answer, but decided it should be [its own question](http://stackoverflow.com/questions/24873550/apply-grouped-model-group-wise) – gregmacfarlane Jul 21 '14 at 19:54
1

You have to be careful with the first approach, because you add a `lm` object to each row of your data frame. With the `iris` data, the resulting data frame has an `object.size` of 3731096 bytes. If you pipe `select(-mod)` after the last line, the resulting data frame only has 8576 bytes. – Jon Snow Mar 23 '15 at 12:29

score 4 · Answer 2 · edited Jun 23 '14 at 20:16

4

Not sure there is a question here, but I think the data.table package is especially efficient here.

#Load data.table package
require(data.table)
iris <- data.table(iris)

#Make a model for each species group
iris.models <- iris[, list(Model = list(lm(Sepal.Length ~ Sepal.Width))),
                      keyby = Species]

#Make predictions on dataset
setkey(iris, Species)
iris[iris.models, prediction := predict(i.Model[[1]], .SD), by = .EACHI]

(for data.table version <= 1.9.2 omit the by = .EACHI part)

edited Jun 23 '14 at 20:16

eddi

49,088
6
104
155

answered Jun 22 '14 at 23:35

Mike.Gahan

4,565
23
39

1

Note the issues raised http://stackoverflow.com/questions/15096811/why-is-using-update-on-a-lm-inside-a-grouped-data-table-losing-its-model-data/15376891#15376891 when using `lm` and `.SD`. – mnel Jun 24 '14 at 00:46
Note that the same issue pointed in the link above also happens with `dplyr` - for the same reason mentioned there. – Arun Jun 24 '14 at 13:44

Apply grouped model back onto data

2 Answers2

Linked