14

I'd like to create a variable containing the value of a variable in the previous year within a group.

     id   date        value
1     1   1992          4.1  
2     1     NA          4.5  
3     1   1991          3.3  
4     1   1990          5.3  
5     1   1994          3.0  
6     2   1992          3.2  
7     2   1991          5.2  

value_lagged should be missing when the previous year is missing within a group - either because it is the first date within a group (as in row 4, 7), or because there are year gaps in the data (as in row 5). Also, value_lagged should be missing when the current time is missing (as in row 2).

This gives:

     id   date    value    value_lagged  
1     1   1992      4.1             3.3
2     1     NA      4.5              NA
3     1   1991      3.3             5.3
4     1   1990      5.3              NA
5     1   1994      3.0              NA
6     2   1992      3.2             5.2
7     2   1991      5.2              NA

For now, in R, I use the data.table package

 DT = data.table(id    = c(1,1,1,1,1,2,2),
                 date  = c(1992,NA,1991,1990,1994,1992,1991),
                 value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2)
                )
 setkey(DT, id, date)
 DT[, value_lagged := DT[J(id, date-1), value], ]
 DT[is.na(date), value_lagged := NA, ]

It's fast but it seems somewhat error prone to me. I'd like to know if there are better alternatives using data.table, dplyr, or any other package. Thanks a lot!


In Stata, one would do:

    tsset id date
    gen value_lagged=L.value
Matthew
  • 2,628
  • 1
  • 20
  • 35
  • unless you specifically want rows with a missing **value** to not have a matching lagged value, you likely meant to use `is.na(date)` not `is.na(value)` – Ricardo Saporta Sep 03 '14 at 16:28
  • @Matthew seems like you already have a decent solution - what exactly do you want to improve about it? – eddi Sep 03 '14 at 17:59
  • My solution seems somewhat convoluted and error prone to me - but maybe because I'm new to R. And even it is a decent solution, there might be even simpler ways to do it! In any case, it's an important question (at least for stata users), so I thought there should be a definitive question/answer on stack overflow about it. – Matthew Sep 03 '14 at 18:06
  • Here's a slight rewrite of your final two steps, that's perhaps slightly less error prone (because you're not repeating `DT` multiple times), but I don't really see what else you want besides having a function that's specifically designed to do what you want (which what your stata solution really is): `DT[J(id, date + 1, val = value), val_lag := i.val][is.na(date), val_lag := NA]` – eddi Sep 03 '14 at 18:39
  • Thanks! Wait why does not it give rows with date==1995? – Matthew Sep 04 '14 at 00:33
  • This is so weird. So `D1=DT[J(id, date + 1, val = value)]`, `D1[, val_lag:=val]` does not give the same than `DT[J(id, date + 1, val = value)][, val_lag:=val]` ? – Matthew Sep 04 '14 at 00:56
  • Well, I understand now -even though it's still a bit puzzling. Would you know if one of the syntax between `DT[J(id, date + 1, val = value), val_lag := i.val]` and `DT[,value_lagged :=DT[list(id,date-1),value],]` preferable for what data.table does? Or are they completely equivalent? – Matthew Sep 04 '14 at 01:57
  • @Matthew they are not equivalent `DT[J(...), a := b]` does the assignments for the matching portion of `DT` from the merge, whereas `DT[J(...)]` may be larger or smaller than `DT` itself. – eddi Sep 04 '14 at 15:10
  • the `DT[J(...), a := b]` syntax is better in that it avoids using the `data.table` name twice, which can be a source of errors; it's also going to be a tiny bit faster, but that's probably irrelevant for any task – eddi Sep 04 '14 at 15:12
  • You should post yours as the answer (for data.table), as that's the way to go about it. There's no copy being made there at all, which is also a requirement from your Q. – Arun Sep 20 '14 at 13:36
  • @eddi actually I've recently realized your answer does not work in case there are no missing dates and no corresponding observation within each group (for instance -8 in my example) – Matthew Sep 20 '14 at 15:59
  • @Arun wrote another method as an answer! – Matthew Sep 30 '14 at 04:49

4 Answers4

11

I'd probably tackle this using a join:

library(dplyr)

df <- data.frame(
  id = c(1, 1, 1, 1, 1, 2, 2), 
  date = c(1992, NA, 1991, 1990, 1994, 1992, 1991), 
  value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2)
)


last_year <- df %>% 
  filter(!is.na(date)) %>%
  mutate(date = date + 1, lagged_value = value, value = NULL)

df %>%
  left_join(last_year)
#> Joining by: c("id", "date")
#>   id date value lagged_value
#> 1  1 1992   4.1          3.3
#> 2  1   NA   4.5           NA
#> 3  1 1991   3.3          5.3
#> 4  1 1990   5.3           NA
#> 5  1 1994   3.0           NA
#> 6  2 1992   3.2          5.2
#> 7  2 1991   5.2           NA
hadley
  • 102,019
  • 32
  • 183
  • 245
8

Using 1.9.5, where joins don't need keys to be set, this can be done as follows:

require(data.table) # v1.9.5+
DT[!is.na(date), value_lagged := 
         .SD[.(id = id, date = date - 1), value, on = c("id", "date")]]
#    id date value value_lagged
# 1:  1 1992   4.1          3.3
# 2:  1   NA   4.5           NA
# 3:  1 1991   3.3          5.3
# 4:  1 1990   5.3           NA
# 5:  1 1994   3.0           NA
# 6:  2 1992   3.2          5.2
# 7:  2 1991   5.2           NA

It's a variation of your idea. The trick is to use is.na() directly in i and use .SD in j instead of DT. I've used on= syntax, but the same idea can of course be done by setting keys as well. .

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Hi! With data.table 1.9.5, installed 11-sep-2015, this gives the error "Error in `[.data.table`(.SD, .(id = id, date = date - 1), value, on = c("id", : unused argument (on = c("id", "date"))" – JBJ Sep 11 '15 at 09:05
  • @JBJ Use `remove.packages()` to uninstall, reinstall and try again. – Arun Sep 15 '15 at 09:19
7

Create a function tlag, which lags a vector given a vector of times, and use it within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))
Matthew
  • 2,628
  • 1
  • 20
  • 35
  • `N=2e6L` is quite small. 0.1 vs 0.23 seconds isn't that impressive.. Could you please try on `2e7L` or even `2e8L`? – Arun Oct 01 '14 at 14:25
  • It's not impressive indeed. The point was really to have a more readable solution, and it turns out that there is no performance cost at 2e6. I've just runned the test : lag is still *slightly* faster at 1e7, but pure data.table is twice faster at 1e8 – Matthew Oct 01 '14 at 14:58
  • 1
    The problem with *readability* is that it is not a proper measure; varies from one to another. I very much like the pure data.table solution, for example :). – Arun Oct 01 '14 at 15:01
  • Readability varies from one person to the next, but that does not mean it's not a proper measure, right? Don't you agree that, for a colleague who browses my code, the lag solution is clearer to read? What about confusing -1 and -1? – Matthew Oct 01 '14 at 15:23
  • Would you post data.table's solution as an answer? I'd suppress it from my question, so that people would see three different answers for creating lagged variables – Matthew Oct 01 '14 at 16:08
  • I don't think that's necessary. You've shown the solution as your attempt (which you felt could be improved - hence the question) and have posted a version which suits you better. And it's a great answer (I +1'd it already). – Arun Oct 01 '14 at 16:21
  • This is good and I upvoted it, but there's no reason to go defining a custom lag function with a dozen major packages already containing one and n-time-period lags. – Hack-R Oct 02 '14 at 14:25
0

An elegant and fast general solution to computations on irregular time series and unbalanced panels is now offered by the collapse package with the functions flag, fdiff and fgrowth. See a general answer for lagging an unbalanced panel here.

Now in your specific application there is an additional rare fact that the panel is not only unbalanced, but you have a missing value in your time variable which means you don't know the time period when a record was observed. In such cases just applying collapse::flag won't do, but you could generate a new id variable that places the missing value in a separate group with collapse::seqid. So my solution would be:

library(collapse)  
DF = data.frame(id    = c(1,1,1,1,1,2,2),
                date  = c(1992,NA,1991,1990,1994,1992,1991),
                value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2))

settransform(DF, l_value = flag(value, 1, g = seqid(date, order(id, date)), t = date))
DF
#>   id date value l_value
#> 1  1 1992   4.1     3.3
#> 2  1   NA   4.5      NA
#> 3  1 1991   3.3     5.3
#> 4  1 1990   5.3      NA
#> 5  1 1994   3.0      NA
#> 6  2 1992   3.2     5.2
#> 7  2 1991   5.2      NA

Created on 2021-07-10 by the reprex package (v0.3.0)

I am quite confident that this is still faster than data.table, but I have not tested this. Again this data is rather odd, for most cases where the panel is unbalanced but the records are all identified by id and time, a simple flag(value, 1, id, as.integer(date)) will do fine and be striking fast. Note that you get efficiency gains by ensuring that the time variable is integer, as flag will coerce non-integer time variables to factor, which could also get rid of the irregularity.

Sebastian
  • 1,067
  • 7
  • 12