0

I'm working with a dataframe that indexes values by three variables, date, campaign and country. Every other value is indexed according to these three values, as follows:

# Groups:   date, campaign [1,325]
   date          campaign country  cost clicks
   <date>           <dbl> <chr>   <dbl>  <dbl>
 1 2021-03-01 10127671839 0        0.45    7
 2 2021-03-01 10127671839 AD       0.47    10
 3 2021-03-01 10127671839 AE       0.39    11
 4 2021-03-01 10127671839 AF       0.27    2
 5 2021-03-01 10127671839 AG          0    0
 6 2021-03-01 10127671839 AI       1.28    2
 7 2021-03-01 10127671839 AL       0.66    6
 8 2021-03-01 10127671839 AM       0.33    2 
 9 2021-03-01 10127671839 AO          0    0
10 2021-03-01 10127671839 AR          0    0
# … with 335,215 more rows

What I'm trying to do is creating a moving average of those values (in the table above, "cost" and "clicks") that is still indexed on country, campaign and date.

Edit: I found a good function that works when there are only two index variables (in here: Rolling mean (moving average) by group/id with dplyr), but I am not skilled enough to tweak the code into working for three or more variables.

  • `df %>% group_by(date, campaign, country) %>% mutate(across(c(cost,clicks), list(rm = ~ zoo::rollmean(., 2, fill=NA))))` – r2evans Jun 21 '21 at 00:46

1 Answers1

3

I think zoo::rollmean works well here, and dplyr::group_by can handle as many index variables as you need:

library(dplyr)
mtcars %>%
  group_by(cyl, am, vs) %>%
  mutate(across(c(mpg,disp), list(rm = ~ zoo::rollmeanr(., 2, fill = NA))))
# # A tibble: 32 x 13
# # Groups:   cyl, am, vs [7]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb mpg_rm disp_rm
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4   NA       NA 
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4   21      160 
#  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1   NA       NA 
#  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1   NA       NA 
#  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2   NA       NA 
#  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1   19.8    242.
#  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4   16.5    360 
#  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2   NA       NA 
#  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2   23.6    144.
# 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4   18.6    196.
# # ... with 22 more rows

The fill=NA argument means that the first in each series has no history to average on, so it is NA. If you prefer the first in a series to be an average of itself, you can instead use partial=TRUE (using rollapplyr instead):

mtcars %>%
  group_by(cyl, am, vs) %>%
  mutate(across(c(mpg,disp), list(rm = ~ zoo::rollapplyr(., 2, FUN = mean, partial = TRUE))))
# # A tibble: 32 x 13
# # Groups:   cyl, am, vs [7]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb mpg_rm disp_rm
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4   21      160 
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4   21      160 
#  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1   22.8    108 
#  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1   21.4    258 
#  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2   18.7    360 
#  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1   19.8    242.
#  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4   16.5    360 
#  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2   24.4    147.
#  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2   23.6    144.
# 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4   18.6    196.
# # ... with 22 more rows

I've used the align="right" variants of zoo's functions, assuming that your moving average is historical and that time increases in subsequent rows. If these assumptions are not true, make sure you intentionally choose between the align-variants.

I used dplyr::across here to handle an arbitrary number of columns in one step: Since I used a named list of "tilde-functions", it took the name of each function and appended it to the name of each of the column names. You can break it out into individual mutate assignments if you prefer, for readability, maintainability, or if you need different sets of arguments for each column.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I've tried your code but for some reason it only returns NAs or, if I use the second code (with partial=TRUE) it only returns the same number as before. I am using n = 7 by the way. Do you know what could it be wrong? – João Paulo Jun 21 '21 at 02:25
  • This is the code: db = db %>% group_by(date, campaign, country) %>% mutate(across(c(clicks,cost), list(mav = ~ rollmeanr(., 7, fill=NA)))) – João Paulo Jun 21 '21 at 02:26
  • 1
    Nevermind, I managed to fix it once I changed row names to put them in order according to date, and removed date from the group_by function – João Paulo Jun 21 '21 at 04:45
  • I'm glad you found it. I admit I was skeptical to *grouping* by date, thinking instead you would want `dat %>% arrange(date) %>% group_by(a, b) ...`. – r2evans Jun 21 '21 at 10:35