multiple models: how to select best model and make prediction

Question

My task is to create many models, choose model that predict best and pass data to this model for prediction. Example inspired from R for data science book

library(modelr)
library(tidyverse)
library(gapminder)
gapminder

country_model1 <- function(df) {lm(lifeExp ~ year, data = df)}

country_model2 <- function(df) {lm(lifeExp ~ year+gdpPercap, data = df)}

country_model3 <- function(df) {lm(lifeExp ~ year+gdpPercap+pop, data = df)}

by_country <- gapminder %>% 
  group_by(country, continent) %>% 
  nest() %>% 
  mutate(model1 = map(data, country_model1),
         model2 = map(data, country_model2),
         model3 = map(data, country_model3))

So I have 3 models for each country. I can find r squared for each model, but stopped here :(

r_sq <- by_country %>% 
  mutate(glance1 = map(model1, broom::glance),
         glance2 = map(model2, broom::glance),
         glance3 = map(model3, broom::glance)) %>% 
  unnest(glance1:glance3, .drop = TRUE) %>% 
  select(country, continent, starts_with('r.sq'))

How to in tidy way:

select which of 3 make better prediction for each particular country?
pass new data to chosen model and have prediction back?

Jon Spring · Accepted Answer · 2021-03-04T07:23:56.803

We can identify the model with the highest r^2 for for each country like this:

best_fits <- r_sq %>%
  pivot_longer(-c(country, continent), names_to = "r_sq_version") %>%
  group_by(country, continent) %>%
  slice_max(value) %>%
  ungroup()

Not too surprisingly, the third model (called here r.squared2 from its name in r_sq) consistently provides the highest correlation, since that model takes more inputs and has more degrees of freedom.

Let's make some new data, taking the original but adding 100 years to the dates.

by_country_new <- gapminder %>%
  group_by(country, continent) %>%
  mutate(year = year + 100,
         gdpPercap = gdpPercap,
         pop = pop) %>%
  select(-lifeExp) %>%    # Presumably we don't know this and are trying to predict using known data
  nest()

We could then apply the best model for each country to the new data: (Thanks to @mrflick for https://stackoverflow.com/a/63201855/6851825)

  best_fits %>%
    left_join(by_country) %>%
    left_join(by_country_new, by = c("country", "continent")) %>%
    mutate(best_model = case_when(
      r_sq_version == "r.squared2" ~ model3,
      r_sq_version == "r.squared1" ~ model2,
      r_sq_version == "r.squared"  ~ model1,
    )) %>%
    select(-c(model1:model3)) %>%
    mutate(prediction = map2(best_model, data.y,
                             ~broom::augment(.x, newdata = .y))) -> new_fits

We can then see how these predictions look like a continuation of the time trend established in the original data (with some other variation due to changes in population and gdp in our new data).

new_predictions <-   new_fits %>%
    filter(country == "Afghanistan") %>%
    select(prediction) %>%
    unnest_wider(prediction) %>%
    flatten_dfr() %>%
  rename(lifeExp = ".fitted")

gapminder %>%
  filter(country == "Afghanistan") %>%
  bind_rows(new_predictions) %>%
  ggplot(aes(year, lifeExp)) +
  geom_point() +
  labs(title = "Afghanistan extrapolated lifeExp")

Thank you, Jon! It's exactly what I tried to find. – Yuriy Barvinchenko Mar 04 '21 at 11:25 — Yuriy Barvinchenko, Mar 04 '21 at 11:25

multiple models: how to select best model and make prediction

1 Answers1