dplyr custom lag function for irregular time series

Question

I have an irregular time series, where there are gaps in the dataset. Further, the data is grouped. The lag functions I have been able to find lag by observation (so they find the prior record in the dataset), but I want to specify a time variable and have the lag calculated by matching the lagged time. This question: R lag/lead irregular time series data is doing a similar thing. However, I can't use zoo solution (I have some sort of package incompatibility and can't use zoo at all) and have been unsuccessful in making the data.table solution into something sufficiently flexible to use as a function with lag amount as an input and the capacity for grouped data.

Test data:

testdf <- data.frame(group = c(1,1,1,1,1,2,2,2,2,2),
                 counter = c(1,2,3,5,6,7,8,9,11,12),
                 xval = seq(100, 1000, 100))
lagamount <- 1

The output should be the vector: NA 100 200 NA 400 NA 600 700 NA 900

This is what I am using at the moment:

library(dplyr)
testout <- group_by(testdf, group) %>%
  mutate(testout = function(x) which((testdf$counter - x) == lagamount))

This gives me a datatype error that something (unspecified) is not a vector.

Is there a way to make this construction work? Alternatively, how could I lag with irregular time series with grouped variables?

Perhaps you can call the relevant `zoo` function as `zoo::needed_function()` and avoid loading the package. — Axeman, Jul 07 '16 at 15:10
Also you're getting that error because you are supplying a function to `mutate`, and it expects a vector. — Axeman, Jul 07 '16 at 15:13
thanks I should have been clearer - zoo won't install on my machine, there is a problem with the Rccp dependency. Thanks for the error message interpretation, i will have another attempt — JenB, Jul 07 '16 at 15:14

Axeman · Accepted Answer · 2016-07-07T16:07:52.500

4

The only way to do this within dplyr, whithout resorting to using do, would be to first make implicit missing values explicit, and filter them out afterwards.

Supply a vector to mutate, and use ifelse (or perhaps the new dplyr::if_else) to check whether the lag is what you want it to be. Example:

library(tidyr)
lagamount <- 2

testout <- group_by(testdf, group) %>%
  complete(group, counter = min(counter):max(counter)) %>% 
  mutate(testout = if_else(counter - lag(counter, lagamount) == lagamount, 
                           lag(xval, lagamount), 
                           NA_real_)) %>% 
  filter(!is.na(xval))

Produces:

Source: local data frame [10 x 4]
Groups: group [2]

   group counter  xval testout
   <dbl>   <dbl> <dbl>   <dbl>
1      1       1   100      NA
2      1       2   200      NA
3      1       3   300     100
4      1       5   400     300
5      1       6   500      NA
6      2       7   600      NA
7      2       8   700      NA
8      2       9   800     600
9      2      11   900     800
10     2      12  1000      NA

edited Jul 07 '16 at 16:07

answered Jul 07 '16 at 15:18

Axeman

32,068
8
81
94

thank you. unfortunately, that only works with a single time unit lag. It can't, for example, retrieve the value that is 2 or 3 time units previous (unless that coincidentally is the previous observation). This is why I was trying to work with `which`. – JenB Jul 07 '16 at 15:37
Ok, the edit should solve that, I think. Otherwise, could you give the expected output for `lagamount = 2`? – Axeman Jul 07 '16 at 15:41
1

sorry, about the delay, was out of internet range. Output definitely looks like what I want. Yes, filingl in the missing data points and then deleting them at the end is a good approach. Will work my way through and accept in a couple of minutes. thank you, I would never have been able to produce this myself. – JenB Jul 07 '16 at 18:51
Glad I could help. Sorry it took me a few iterations to understand what was going wrong. – Axeman Jul 07 '16 at 21:35
Isn't that code redundant? If you `complete()` the dataset, you will always end up with `counter - dplyr::lag(counter, lagamount) == lagamount`, no (at least after sorting by counter)? – Matifou Aug 30 '18 at 18:50

score 1 · Answer 2 · answered Jul 10 '21 at 17:33

An efficient solution to this is now available in collapse::flag (and also fdiff and fgrowth). When creating the data you need to make sure the time variable is integer, otherwise it will internally be converted to factor which in this case removes the irregularity.

testdf <- data.frame(group = c(1,1,1,1,1,2,2,2,2,2),
                     counter = as.integer(c(1,2,3,5,6,7,8,9,11,12)),
                     xval = seq(100, 1000, 100))
lagamount <- 1

Then we can write:

library(collapse)
settransform(testdf, L_xval = flag(xval, lagamount, group, counter))
testdf
#>    group counter xval L_xval
#> 1      1       1  100     NA
#> 2      1       2  200    100
#> 3      1       3  300    200
#> 4      1       5  400     NA
#> 5      1       6  500    400
#> 6      2       7  600     NA
#> 7      2       8  700    600
#> 8      2       9  800    700
#> 9      2      11  900     NA
#> 10     2      12 1000    900

^{Created on 2021-07-10 by the reprex package (v0.3.0)}

You also have the option to generate a sequence of lags and leads (or specific lag/lead orders), either using pipes:

testdf |> gby(group) |> flag(-1:3, counter)
#>    group counter F1.xval xval L1.xval L2.xval L3.xval
#> 1      1       1     200  100      NA      NA      NA
#> 2      1       2     300  200     100      NA      NA
#> 3      1       3      NA  300     200     100      NA
#> 4      1       5     500  400      NA     300     200
#> 5      1       6      NA  500     400      NA     300
#> 6      2       7     700  600      NA      NA      NA
#> 7      2       8     800  700     600      NA      NA
#> 8      2       9      NA  800     700     600      NA
#> 9      2      11    1000  900      NA     800     700
#> 10     2      12      NA 1000     900      NA     800
#> 
#> Grouped by:  group  [2 | 5 (0)]

^{Created on 2021-07-10 by the reprex package (v0.3.0)}

Or doing in-place modification with settransformv:

settransformv(testdf, "xval", flag, -1:3, group, counter, apply = FALSE)
testdf
#>    group counter xval F1.xval L1.xval L2.xval L3.xval
#> 1      1       1  100     200      NA      NA      NA
#> 2      1       2  200     300     100      NA      NA
#> 3      1       3  300      NA     200     100      NA
#> 4      1       5  400     500      NA     300     200
#> 5      1       6  500      NA     400      NA     300
#> 6      2       7  600     700      NA      NA      NA
#> 7      2       8  700     800     600      NA      NA
#> 8      2       9  800      NA     700     600      NA
#> 9      2      11  900    1000      NA     800     700
#> 10     2      12 1000      NA     900      NA     800

^{Created on 2021-07-10 by the reprex package (v0.3.0)}

fdiff and fgrowth work similarly and also support iterations and compunding. You can also apply these functions to irregular time series (without a panel-id), then you need to specify t = counter. All functions can be applied to vectors / time series, matrices / xts, data frames / data tables / tibbles, and also support plm panel-series and data frames if you are looking for an object oriented approach.

More information at: https://sebkrantz.github.io/collapse/reference/time-series-panel-series.html

just a quick question, why do you use apply = FALSE here? apply = TRUE,should give the same solution? Can you elaborate a bit what exactly is the difference between apply = TRUE and apply = FALSE — Vitalijs, Jul 28 '22 at 08:50
@Vitalijs the difference is that `apply = TRUE` applies the function to the whole subset of the frame. If we have more than 1 variable to lag this is more efficient, because using `flag.data.frame` means we only need to group/index the data once and can apply the lagging across columns in C++, instead of using `lapply` with `flag.default` and re-indexing each time. — Sebastian, Aug 16 '22 at 23:18

JenB · Answer 3 · 2016-07-18T11:45:18.200

I ended up having to make the expansion explicit and removed the strict datatyping in the if_else when I turned the above answer into a function. This is the final form.

getlag <- function(timevar, valuevar, laglength){
  df1 <- data.frame(counter = timevar, value = valuevar, indf = 1)
  alltimes <- data.frame(counter = seq(min(timevar), max(timevar)))
  df2 <- merge(alltimes, df1, all.x = TRUE)
  df2 <- df2 %>%
    mutate(lagvals = ifelse(counter - lag(counter, laglength) == laglength,
                            lag(value, laglength),
                            NA_real_)) %>%
    filter(!is.na(indf))
  return(df2$lagvals)
  }

And the test use case is:

testout <- group_by(testdf, group) %>%
  mutate(testout = getlag(counter, xval, 1))

dplyr custom lag function for irregular time series

3 Answers3

Linked