I would like to apply a certain function (namely AddLags
from below) to groups of a dataframe. To achieve this, I am trying to use two consecutive map_dfr
(piping one to another), so as to apply the respective filters. For the last step, I am applying the custom function (mentioned earlier) - using map_dfr
(to capture the newly calculated output data in a new object).
The code I have so far is as follows:
# dummy dataset
df <- data.frame(
date = seq(today(),length.out=12,by='month'),
dim1 = c('a','a','a','b','b','b','c','c','c','d','d','d'),
dim2 = c(1,1,1,1,1,1,2,2,2,2,2,2),
value = 1:12
)
# function to apply
AddLags <- function(df,lags_vector,target_col,date_col){
temp_lags <- map_dfc(lags_vector,
~ df %>%
arrange({{date_col}}) %>%
transmute(
across(contains(target_col), lag, .x, .names = '{col}_lag_{ifelse(.x<10,paste0("0",.x),.x)}')
)
)
return(temp_lags)
}
# prepare for map_dfr approach
lags_features <- c(1,2)
dims1 <- df %>% pull(dim1) %>% unique %>% sort
dims2 <- df %>% pull(dim2) %>% unique %>% sort
# what I am struggling with
map_dfr(dims1,
~ df %>%
filter(dim1==.x) %>%
map_dfr(dims2,
~ . %>%
filter(dim2==.x) %>%
AddLags(lags_features,variable,date)
)
)
# how the loop version would look like
gather_results <- data.frame()
for(d1 in dims1){
for(d2 in dims2){
tempdata <- df %>% filter(dim1==d1,dim2==dim2) %>% arrange(date)
temp <- AddLags(tempdata)
gather_results %<>% bind_rows(temp)
}
}
In essence, I am traversing through the different groups (through filtering) and applying the custom function respectively, while trying to use map_dfr
to consolidate the newly calculated results.
I would like to know how to achieve the above (assuming that is feasible) and what am I missing since for the time being all I get back is an empty dataframe.
BONUS QUESTION:
As I am writing this, I realize that there has to be a better way of doing this instead of looping - for instance using a group_by
- but given the nature of the problem and the fact that the function outputs new data, I am not sure how this would look like (assuming is feasible to begin with). So, any kind of suggestion/alternative/best practice would be much appreciated.
DISCLAIMER:
I a big noob when it comes to purrr
functionality and not much of an experienced dplyr
user either, so kindly forgive my ignorance.