Using group_by() from dplyr with predict.lm and do() in a pipe for year linear extrapolation

Question

I would like to apply a linear extrapolation for one year within a pipe. What I would like to do is very similar to this simple example without grouping. But within a pipe and employing dplyr::group_by(). There are some examples like this one, this one or this one. But I can't manage to get the desirable output.

Reproducible example:

test.frame <- data.frame(Country = 
rep(c("Austria", "Brazil", "Canada"), each = 3, times = 3), 
  Entity = rep(c("CO2","CH4","N2O"), times = 9),
  Year = rep(c(1990:1992), each = 9),
  value = runif(27, 1,5))

test.frame2 <- data.frame(Country = 
rep(c("Austria", "Brazil", "Canada"), each = 3), 
    Entity =  rep(c("CO2","CH4","N2O"), times = 3),
    Year = rep(c(1993), each = 3),
    value = 0)

results_frame <- test.frame %>% 
  dplyr::bind_rows(test.frame2)

I have two grouping categories (Country and Entity), and I would like to use the values for years 1990 to 1992 to fill those for year 1993 using linear extrapolation. On the basis of this, I can estimate the linear model:

linear_model <- test.frame %>%  
dplyr::group_by(Country, Entity) %>% 
lm(value ~ Year, data=.)

results <- predict.lm(linear_model, test.frame2)

But then, results doesn't show the desirable output. So following the solution proposed here I try the following:

results_frame <- test.frame %>%
  dplyr::group_by(Country, Entity) %>% 
  do(lm( value ~ Year , data = test.frame)) %>%
  predict.lm(linear_model, test.frame2) %>% 
  bind_rows(test.frame)

But it doesn't work, instead I get

Error: Results 1, 2, 3, 4, 5, ... must be data frames, not lm

Any help would be highly appreciated!

score 4 · Accepted Answer · edited Oct 21 '21 at 13:43

You have to be careful to use the right data when fitting and predicting:

library(dplyr)
set.seed(42)
test.frame <- tibble(Country = rep(c("Austria", "Brazil", "Canada"), each = 3, times = 3), 
                         Entity = rep(c("CO2","CH4","N2O"), times = 9),
                         Year = rep(c(1990:1992), each = 9),
                         value = runif(27, 1,5))

test.frame %>%
  group_by(Country, Entity) %>% 
  do(lm( value ~ Year , data = .) %>% 
       predict(., tibble(Year = 1993)) %>%
       tibble(Year = 1993, value = .)) %>%
  bind_rows(test.frame)
#> # A tibble: 36 x 4
#> # Groups:   Country, Entity [9]
#>    Country Entity  Year value
#>    <fct>   <fct>  <dbl> <dbl>
#>  1 Austria CH4     1993 2.10 
#>  2 Austria CO2     1993 2.03 
#>  3 Austria N2O     1993 6.02 
#>  4 Brazil  CH4     1993 4.90 
#>  5 Brazil  CO2     1993 0.771
#>  6 Brazil  N2O     1993 5.28 
#>  7 Canada  CH4     1993 4.69 
#>  8 Canada  CO2     1993 0.729
#>  9 Canada  N2O     1993 1.49 
#> 10 Austria CO2     1990 4.66 
#> # ... with 26 more rows

acylam · Answer 2 · 2018-08-03T18:57:06.797

You can do something like the following using nested data.frames. This solution is more general in that one do not need to re-create test.frame2 after the prediction and there could be more than one independent variables:

library(tidyverse)
test.frame %>%
  group_by(Country, Entity) %>%
  nest() %>%
  inner_join(test.frame2 %>% select(-value) %>% group_by(Country, Entity) %>% nest(),
             by = c("Country", "Entity")) %>%
  mutate(model = data.x %>% map(~lm(value ~ Year, data=.)),
         value = map2(model, data.y, predict)) %>%
  select(-data.x, -model) %>%
  unnest() %>%
  bind_rows(test.frame, .)

Result:

   Country Entity Year      value
1  Austria    CO2 1990  3.6245955
2  Austria    CH4 1990  3.3857752
3  Austria    N2O 1990  1.4798741
4   Brazil    CO2 1990  2.5865668
5   Brazil    CH4 1990  1.3271481
6   Brazil    N2O 1990  4.4537926
7   Canada    CO2 1990  4.7295768
8   Canada    CH4 1990  4.5255033
9   Canada    N2O 1990  2.3129381
10 Austria    CO2 1991  4.8810838
11 Austria    CH4 1991  4.9950455
12 Austria    N2O 1991  2.1288504
13  Brazil    CO2 1991  4.7767443
14  Brazil    CH4 1991  2.0315449
15  Brazil    N2O 1991  1.9307966
16  Canada    CO2 1991  4.6831029
17  Canada    CH4 1991  2.2761538
18  Canada    N2O 1991  3.0856428
19 Austria    CO2 1992  3.1223000
20 Austria    CH4 1992  4.7715588
21 Austria    N2O 1992  1.5733608
22  Brazil    CO2 1992  2.9463442
23  Brazil    CH4 1992  1.9569259
24  Brazil    N2O 1992  1.4428006
25  Canada    CO2 1992  3.0750847
26  Canada    CH4 1992  1.4635521
27  Canada    N2O 1992  2.8061861
28 Austria    CO2 1993  3.3736976
29 Austria    CH4 1993  5.7699101
30 Austria    N2O 1993  1.8208485
31  Brazil    CO2 1993  3.7963291
32  Brazil    CH4 1993  2.4016508
33  Brazil    N2O 1993 -0.4018621
34  Canada    CO2 1993  2.5080960
35  Canada    CH4 1993 -0.3068815
36  Canada    N2O 1993  3.2281704

Using group_by() from dplyr with predict.lm and do() in a pipe for year linear extrapolation

2 Answers2