Given a data.frame of grouped data:
library(tidyverse)
# fake up some grouped data:
set.seed(123)
dat <- data.frame(x = rnorm(100),
y = rnorm(100),
group = rep(x=letters[1:10],each=10))
head(dat)
> head(dat)
x y group
1 -0.56047565 -0.71040656 a
2 -0.23017749 0.25688371 a
3 1.55870831 -0.24669188 a
4 0.07050839 -0.34754260 a
5 0.12928774 -0.95161857 a
6 1.71506499 -0.04502772 a
I want to build a set of independent models by one (or more) grouping columns:
# store models by group in a list
models <- list()
for(i in letters[1:10]) {
models[[paste0("mdl_",i)]] = lm(y ~ x, dat %>% filter(group == i))
}
names(models)
[1] "mdl_a" "mdl_b" "mdl_c" "mdl_d" "mdl_e" "mdl_f" "mdl_g" "mdl_h" "mdl_i" "mdl_j"
I can add the model predictions (fitted values) to the original data frame a number of ways, this way is convenient:
# add model predictions (fitted values) column to original data frame
dat <- dat %>%
group_by(group) %>%
mutate(fits = lm(y ~ x)$fitted.values)
# verify prediction from stored models and fitted values column match
# to within a 10-decimal tolerance:
for(i in letters[1:10]) {
tmp <- dat %>%
filter(group == i) %>%
select(group, x, y, fits)
tmp$stored_fit = predict(models[[paste0("mdl_",i)]], tmp)
print(paste("mdl", i, "results match:", all(round(tmp$stored_fit,10) == round(tmp$fits,10))))
}
[1] "mdl a results match: TRUE"
[1] "mdl b results match: TRUE"
[1] "mdl c results match: TRUE"
[1] "mdl d results match: TRUE"
[1] "mdl e results match: TRUE"
[1] "mdl f results match: TRUE"
[1] "mdl g results match: TRUE"
[1] "mdl h results match: TRUE"
[1] "mdl i results match: TRUE"
[1] "mdl j results match: TRUE"
All of these steps have been discused in other questions like this one.
Now I want to generate the predictions from these models on a new data.frame and add those predictions as a column to that data.frame.
Here's a couple things I tried:
# fake up some new grouped data:
set.seed(456)
dat2 <- data.frame(x = rnorm(100),
y = rnorm(100),
group = rep(x=letters[1:10],each=10))
Method 1 (apply):
tmp <- dat2 %>%
group_by(group) %>%
nest() # %>%
# mutate(fits = map())
fits = as.data.frame(apply(X = tmp, MARGIN=1, FUN = function(X) predict(models[[paste0("mdl_",X$group)]], X$data)))
names(fits) = tmp$group
fits <- fits %>%
pivot_longer(cols = everything(), names_to = "group.fits") %>%
arrange(group.fits)
tmp <- tmp %>%
unnest(cols = c(data)) %>%
bind_cols(fits)
... which just feels error-prone and inelegant.
Method 2 (for loop, base r):
tmp$fits = NA
for(g in unique(tmp$group)) {
tmp[tmp$group==g,]$fits = predict(models[[paste0("mdl_",g)]], tmp[tmp$group==g,])
}
tmp
Nothing particularly wrong with this other than for loops being notoriously slow on larger datasets.
Method 3 (nest/map):
I thought something like the following would work but I have something wrong in the syntax...
dat2 %>%
group_by(group) %>%
nest() %>%
mutate(fits = map(.f = predict(models[[paste0("mdl_",group)]]), data))
or
mutate(fits = map(.x = data,
.f = predict(models[[paste0("mdl_",group)]],
.x)))
I'm looking for an answer somewhere along Method 3's route - ideally all within one set of dplyr commands.