1

I am trying to create a process that takes in a dataframe and creates additional lagged and rolling window features (e.g. moving average). This is what I have so far.

# dummy dataframe
n <- 20
set.seed(123)
foo <- data.frame(
  date = seq(as.Date('2020-01-01'),length.out = n, by = 'day'),
  var1 = sample.int(n),
  var2 = sample.int(n))

# creates lags and based on (some of) them creates rolling average features
foo %>% 
  mutate_at(vars(starts_with('var')),
            funs(lag_1 = lag(.), lag_2 = lag(.,2))) %>% 
  mutate_at(vars(contains('lag_1')),
            funs(ra_3 = rollmean(., k = 3, align = 'right', fill = NA)))

The above chunk :

  1. creates lag01,lag02 features considering the selected variables
  2. based on a subset of the newly created columns, creates rolling average features

What I am now looking for, is to create an arbitrary number of lagged features (e.g. lag3,lag6,lag9 so on) as well as create an arbitrary number of rolling average features (of different window length - i.e. var1_lag_1_ra_3, var1_lag_1_ra_6, var2_lag_1_ra_3, var2_lag_1_ra_6. At the moment the settings to generate such features are hardcoded. Ideally I would have couple of vectors to adjust the outcome; like so:

lag_features <- c(3,6,9)
ma_features <- c(12,15)

Lastly, it would be quite nice if there was a way to configure the names of the generated features in a dynamic manner. I 've seen {{}},!!,:= operators, but I am not really in a position to tell the difference or how to use them.

I have also implemented the above using some readily available functions from the timetk package, but since I am looking for some additional flexibility, I was wondering how I could replicate such behavior myself.

library(timetk)
foo %>% 
  select(date,starts_with('var')) %>%
  tk_augment_lags(.value = starts_with("var"),
                  .lags = 1) %>% 
  tk_augment_slidify(.value   = ends_with("lag1"),
                     .period  = seq(0,24,3)[-1],
                     .f       = mean,
                     .align   = 'right', 
                     .partial = TRUE
  )

Any support would be really appreciated.

takmers
  • 71
  • 1
  • 5
  • Just a note: `mutate_at` is outdated. Use `mutate` combined with `across` instead. :-) – Martin Gal Oct 19 '21 at 21:35
  • You are absolutely right, I have tested that as well, still cannot overcome hardcoding `foo %>% select(contains('var')) %>% mutate(across(starts_with('var'), list(lag01 = lag, lag02 = ~lag(.,2))))` – takmers Oct 19 '21 at 22:02
  • To be honest: I read your question twice and I still don't understand what you are trying to do. – Martin Gal Oct 19 '21 at 22:07
  • Say for example that I need to create c(1,2,3,6,12,24) lags -across selected variables - and subsequently create rolling averages of c(3,6,12,24) window length - based on lag1 and across selected variables - w/o having to manually adjust the code every time.. that's why I need to somehow pivot away from the hardcoded values and towards a more flexible/dynamic way to create the above.. is it more clear now? – takmers Oct 19 '21 at 23:31

1 Answers1

1

You can use the map function to get the lagged value for variable numbers. We can use the .names argument in across to provide names to new columns.

library(dplyr)
library(purrr)
library(zoo)

lag_features <- c(3,6,9)
ma_features <- c(12,15)

foo <- bind_cols(foo, map_dfc(lag_features, ~foo %>% 
                         transmute(across(starts_with('var'), 
                                          lag, .x, .names = '{col}_lag{.x}'))),
                map_dfc(ma_features, ~foo %>%
                        transmute(across(contains('lag3'), rollmeanr, k = .x, 
                             fill = NA, .names = '{col}_{.x}'))))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Great solution, many thanks! By any chance, have you got any alternatives in mind that wouldn't require two separate assignments - in other words make all the above happen in a single step? – takmers Oct 20 '21 at 05:41
  • Yes, sure..See the updated answer. – Ronak Shah Oct 20 '21 at 05:54