-1

I would like to apply a certain function (namely AddLags from below) to groups of a dataframe. To achieve this, I am trying to use two consecutive map_dfr (piping one to another), so as to apply the respective filters. For the last step, I am applying the custom function (mentioned earlier) - using map_dfr (to capture the newly calculated output data in a new object).

The code I have so far is as follows:

# dummy dataset
df <- data.frame(
  date = seq(today(),length.out=12,by='month'),
  dim1 = c('a','a','a','b','b','b','c','c','c','d','d','d'),
  dim2 = c(1,1,1,1,1,1,2,2,2,2,2,2),
  value = 1:12
  )

# function to apply
AddLags <- function(df,lags_vector,target_col,date_col){
  temp_lags <- map_dfc(lags_vector, 
                       ~ df %>% 
                         arrange({{date_col}}) %>% 
                         transmute(
                           across(contains(target_col), lag, .x, .names = '{col}_lag_{ifelse(.x<10,paste0("0",.x),.x)}')
                         )
  )
  return(temp_lags)
}


# prepare for map_dfr approach
lags_features <- c(1,2)
dims1 <- df %>% pull(dim1) %>% unique %>% sort
dims2 <- df %>% pull(dim2) %>% unique %>% sort

# what I am struggling with
map_dfr(dims1, 
        ~ df %>%
          filter(dim1==.x) %>%
          map_dfr(dims2,
                 ~ . %>% 
                   filter(dim2==.x) %>% 
                   AddLags(lags_features,variable,date)
          )
)

# how the loop version would look like
gather_results <- data.frame()
for(d1 in dims1){
  for(d2 in dims2){
    tempdata <- df %>% filter(dim1==d1,dim2==dim2) %>% arrange(date)
    temp <- AddLags(tempdata)
    gather_results %<>% bind_rows(temp)   
  }
}

In essence, I am traversing through the different groups (through filtering) and applying the custom function respectively, while trying to use map_dfr to consolidate the newly calculated results.

I would like to know how to achieve the above (assuming that is feasible) and what am I missing since for the time being all I get back is an empty dataframe.

BONUS QUESTION: As I am writing this, I realize that there has to be a better way of doing this instead of looping - for instance using a group_by - but given the nature of the problem and the fact that the function outputs new data, I am not sure how this would look like (assuming is feasible to begin with). So, any kind of suggestion/alternative/best practice would be much appreciated.

DISCLAIMER: I a big noob when it comes to purrr functionality and not much of an experienced dplyr user either, so kindly forgive my ignorance.

AndrewGB
  • 16,126
  • 5
  • 18
  • 49
takmers
  • 71
  • 1
  • 5
  • In the code, you used `variable`, but I didn't find that column name in 'df' – akrun Jan 12 '22 at 16:58
  • 2
    Rather than having the could, it would be great to also include the expected output and the LOGIC explaining how you get that expected output. – Onyambu Jan 12 '22 at 17:04
  • Also, please explain why `group_map` or one of its siblings doesn't meet your needs. – Limey Jan 12 '22 at 17:18
  • @Limey group_map sounds like would do the trick. I have never used it, so I wouldn't know. To be honest, I wasn't even aware of its existence - hence the disclaimer. – takmers Jan 12 '22 at 17:36

2 Answers2

1

Is this the expected output?

library(tidyverse)
library(lubridate)

group_split(df, dim1, dim2) %>%
  map_dfr(~ .x %>% AddLags(1:2, "value", date))
#> # A tibble: 12 × 2
#>    value_lag_01 value_lag_02
#>           <int>        <int>
#>  1           NA           NA
#>  2            1           NA
#>  3            2            1
#>  4           NA           NA
#>  5            4           NA
#>  6            5            4
#>  7           NA           NA
#>  8            7           NA
#>  9            8            7
#> 10           NA           NA
#> 11           10           NA
#> 12           11           10

Data:

# dummy dataset
df <- data.frame(
  date = seq(today(), length.out = 12, by = "month"),
  dim1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"),
  dim2 = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  value = 1:12
)

# function to apply
AddLags <- function(df, lags_vector, target_col, date_col) {
  temp_lags <- map_dfc(
    lags_vector,
    ~ df %>%
      arrange({{ date_col }}) %>%
      transmute(
        across(contains(target_col), lag, .x, .names = '{col}_lag_{ifelse(.x<10,paste0("0",.x),.x)}')
      )
  )
  return(temp_lags)
}

Created on 2022-01-13 by the reprex package (v2.0.1)

jpdugo17
  • 6,816
  • 2
  • 11
  • 23
0

As @Limey suggested a possible way would be to use the group_map function :

results_df <- data.frame()
results_df <- 
  bind_rows(
    df %>% 
      group_by(dim1,dim2) %>% 
      group_map(~AddLags(.,c(1,2),'value',date))
  )

And the expected results would be :

   value_lag_01 value_lag_02
          <int>        <int>
 1           NA           NA
 2            1           NA
 3            2            1
 4           NA           NA
 5            4           NA
 6            5            4
 7           NA           NA
 8            7           NA
 9            8            7
10           NA           NA
11           10           NA
12           11           10

However, I personally I would go with @jpdugo17 approach

takmers
  • 71
  • 1
  • 5