Computing lags but grouping by two categories with dplyr

Question

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.

YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...

My issue is similar to the following question/post, but grouping by two categories:

dplyr: lead() and lag() wrong when used with group_by()

I tried to extend the solution, unsuccessfully (I get NAs).

Attempt #1:

data %>%
  group_by(YEAR,ID) %>%
  summarise(var1 = ...
            var2 = ...
            var3 = var1 - dplyr::lag(var2))
  )

Attempt #2:

data %>%
  group_by(YEAR,ID) %>%
  summarise(var1 = ...
            var2 = ...
            gr = sprintf(YEAR,ID)
            var3 = var1 - dplyr::lag(var2, order_by = gr))
  )

Minimum example:


MyData <- 
data.frame(YEAR = rep(seq(2010,2014),5),
           ID = rep(1:5, each=5),
           var1 = rnorm(n=25,mean=10,sd=3),
           var2 = rnorm(n=25,mean=1,sd=1)
           )

MyData %>%
  group_by(YEAR,ID) %>%
  summarise(var3 = var1 - dplyr::lag(var2)
  )

Thanks in advance.

What was unsuccessful about your attempts? The first looks as I would expect. It might help to have some data we can play with, along with your expected output of that sample data. Thanks — r2evans, Jun 19 '20 at 20:38
@r2evans I'm getting NAs. I updated the post to add an example. Thanks. — Ariel, Jun 19 '20 at 20:51
If you are doing a lag or lead operation, you will *always* get `NA` values unless you assign `lag(..., default=)` values. Think about this: what do you *expect* from `dplyr::lag(1:3)` ... the 2nd and 3rd values are `1:2`, but what is the 1st value? — r2evans, Jun 19 '20 at 21:04
And specific to this ... your `group_by` effectively turns this into a row-by-row (or `rowwise()`) operation. When you have a single row of `YEAR=2010,ID=1`, what do you expect `lag` to produce? — r2evans, Jun 19 '20 at 21:05
I dont think you want to group by both `ID` and `Year`? Is this what you want? `MyData %>% group_by(ID) %>% mutate(var3 = var1 - dplyr::lag(var2))` — Magnus Nordmo, Jun 19 '20 at 21:06
I think I'll return to my first comment, but more important now since it seems to me that you are either giving too brief a data sample or do not understand what should be happening with `group_by`: please provide expected output given your `MyData`. — r2evans, Jun 19 '20 at 21:06

r2evans · Accepted Answer · 2020-06-19T21:15:03.037

Do you mean group_by(ID) and effectively "order by YEAR"?

MyData  %>%
  group_by(ID) %>%
  mutate(var3 = var1 - dplyr::lag(var2)) %>%
  print(n=99)
# # A tibble: 25 x 5
# # Groups:   ID [5]
#     YEAR    ID  var1    var2  var3
#    <int> <int> <dbl>   <dbl> <dbl>
#  1  2010     1 11.1   1.16   NA   
#  2  2011     1 13.5  -0.550  12.4 
#  3  2012     1 10.2   2.11   10.7 
#  4  2013     1  8.57  1.43    6.46
#  5  2014     1 12.6   1.89   11.2 
#  6  2010     2  8.87  1.87   NA   
#  7  2011     2  5.30  1.70    3.43
#  8  2012     2  6.81  0.956   5.11
#  9  2013     2 13.3  -0.0296 12.4 
# 10  2014     2  9.98 -1.27   10.0 
# 11  2010     3  8.62  0.258  NA   
# 12  2011     3 12.4   2.00   12.2 
# 13  2012     3 16.1   2.12   14.1 
# 14  2013     3  8.48  2.83    6.37
# 15  2014     3 10.6   0.190   7.80
# 16  2010     4 12.3   0.887  NA   
# 17  2011     4 10.9   1.07   10.0 
# 18  2012     4  7.99  1.09    6.92
# 19  2013     4 10.1   1.95    9.03
# 20  2014     4 11.1   1.82    9.17
# 21  2010     5 15.1   1.67   NA   
# 22  2011     5 10.4   0.492   8.76
# 23  2012     5 10.0   1.66    9.51
# 24  2013     5 10.6   0.567   8.91
# 25  2014     5  5.32 -0.881   4.76

(Disregarding your summarize into a mutate for now.)

Yes. In summary, change the function to mutate, and grouping only by ID. Thanks a lot. — Ariel, Jun 19 '20 at 21:13

Computing lags but grouping by two categories with dplyr

1 Answers1