2

My task is to create many models, choose model that predict best and pass data to this model for prediction. Example inspired from R for data science book

library(modelr)
library(tidyverse)
library(gapminder)
gapminder

country_model1 <- function(df) {lm(lifeExp ~ year, data = df)}

country_model2 <- function(df) {lm(lifeExp ~ year+gdpPercap, data = df)}

country_model3 <- function(df) {lm(lifeExp ~ year+gdpPercap+pop, data = df)}

by_country <- gapminder %>% 
  group_by(country, continent) %>% 
  nest() %>% 
  mutate(model1 = map(data, country_model1),
         model2 = map(data, country_model2),
         model3 = map(data, country_model3))

So I have 3 models for each country. I can find r squared for each model, but stopped here :(

r_sq <- by_country %>% 
  mutate(glance1 = map(model1, broom::glance),
         glance2 = map(model2, broom::glance),
         glance3 = map(model3, broom::glance)) %>% 
  unnest(glance1:glance3, .drop = TRUE) %>% 
  select(country, continent, starts_with('r.sq')) 

How to in tidy way:

  1. select which of 3 make better prediction for each particular country?
  2. pass new data to chosen model and have prediction back?
Yuriy Barvinchenko
  • 1,465
  • 1
  • 12
  • 17

1 Answers1

2

We can identify the model with the highest r^2 for for each country like this:

best_fits <- r_sq %>%
  pivot_longer(-c(country, continent), names_to = "r_sq_version") %>%
  group_by(country, continent) %>%
  slice_max(value) %>%
  ungroup() 

Not too surprisingly, the third model (called here r.squared2 from its name in r_sq) consistently provides the highest correlation, since that model takes more inputs and has more degrees of freedom.

Let's make some new data, taking the original but adding 100 years to the dates.

by_country_new <- gapminder %>%
  group_by(country, continent) %>%
  mutate(year = year + 100,
         gdpPercap = gdpPercap,
         pop = pop) %>%
  select(-lifeExp) %>%    # Presumably we don't know this and are trying to predict using known data
  nest()

We could then apply the best model for each country to the new data: (Thanks to @mrflick for https://stackoverflow.com/a/63201855/6851825)

  best_fits %>%
    left_join(by_country) %>%
    left_join(by_country_new, by = c("country", "continent")) %>%
    mutate(best_model = case_when(
      r_sq_version == "r.squared2" ~ model3,
      r_sq_version == "r.squared1" ~ model2,
      r_sq_version == "r.squared"  ~ model1,
    )) %>%
    select(-c(model1:model3)) %>%
    mutate(prediction = map2(best_model, data.y,
                             ~broom::augment(.x, newdata = .y))) -> new_fits

We can then see how these predictions look like a continuation of the time trend established in the original data (with some other variation due to changes in population and gdp in our new data).

new_predictions <-   new_fits %>%
    filter(country == "Afghanistan") %>%
    select(prediction) %>%
    unnest_wider(prediction) %>%
    flatten_dfr() %>%
  rename(lifeExp = ".fitted")

gapminder %>%
  filter(country == "Afghanistan") %>%
  bind_rows(new_predictions) %>%
  ggplot(aes(year, lifeExp)) +
  geom_point() +
  labs(title = "Afghanistan extrapolated lifeExp")

enter image description here

Jon Spring
  • 55,165
  • 4
  • 35
  • 53