1

R noob (still) here, working in tidyverse / RStudio.

I have a tidy dataset where each row has a date, a grouping characteristic, and a value (actual dataset more complicated but that's the core of it):

I group the data by Group for each Date, and calculate some summary stats of the Value, yielding a by-group summary for each date. For instance:

grouped <- data %>% group_by(Date, Group) %>% summarise(mean = mean(Value))
head(grouped)
# A tibble: 6 × 3
# Groups:   Date [4]
  Date       Group  mean
  <date>     <fct> <dbl>
1 2021-02-18 A      37.4
2 2021-02-19 B      25.5
3 2021-02-19 A      26.1
4 2021-02-22 B      34.2
5 2021-02-22 A      26.4
6 2021-02-23 B      34.2

(Note: data is below for reproducibility.)

So far so good. Now I want to take the moving average of those summary stats (mean in this case, but could be others) by Group. I tried this with zoo::rollmean:

grouped <- grouped %>% 
    group_by(Group) %>% 
    mutate(rolling = zoo::rollmean(mean, window_length, fill=NA))

But here a problem arises - ideally, the moving average should be strictly some number of days, not records, but there are some days missing for one or both groups.

What's the best way to ensure that the moving average correctly takes into account the missing days x groups, treating them as NA as needed?

(I understand from this answer that zoo::rollmean wouldn't be able to handle NA values, but zoo::rollapply should be able to.)

I have tried creating a simple calendar dataframe with the full set of dates to join the grouped data to, but that leaves the Group variable as NA as well, so the missing days x groups are still ignored by the rollmean / rollapply function.

Hope that all makes sense!


data <- structure(list(Date = structure(c(18676, 18677, 18677, 18680, 
18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 
18680, 18680, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 
18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 
18681, 18681, 18681, 18682, 18682, 18682, 18682, 18682, 18683, 
18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 
18683, 18683, 18683, 18684, 18684, 18684, 18684, 18684, 18684, 
18684, 18684, 18684, 18684, 18684, 18685, 18685, 18685, 18685, 
18685, 18685, 18685, 18685, 18685, 18685, 18685, 18687, 18687, 
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18688, 18688, 
18688, 18688, 18688, 18688, 18688, 18688, 18688, 18689, 18689, 
18689, 18689, 18689, 18689, 18690, 18690, 18690, 18690, 18690, 
18690, 18690, 18690, 18691, 18691, 18691, 18691, 18691, 18691, 
18691, 18691, 18691, 18691, 18692, 18692, 18692, 18692, 18692, 
18692, 18692, 18692, 18692, 18692, 18692, 18692, 18693, 18694, 
18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 
18694, 18694, 18695, 18695, 18695, 18695, 18695, 18695, 18695, 
18695, 18695, 18696, 18696, 18696, 18696, 18696, 18696, 18696, 
18696, 18696, 18697, 18697, 18697, 18697, 18697, 18697, 18697, 
18697, 18697, 18698, 18698, 18698, 18698, 18698, 18698, 18698, 
18698, 18698, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 
18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 
18699, 18699, 18699, 18700, 18701, 18701, 18701, 18701, 18701, 
18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 
18701, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 
18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 
18702, 18702, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 
18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 
18703, 18703, 18703), class = "Date"), Group = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("B", "A"), class = "factor"), 
    Value = c(37.43, 26.13, 25.54, 31.65, 26.95, 15.29, 35.93, 
    28.59, 17.14, 30.42, 20.52, 33.4, 35.3, 36.87, 28.32, 21.78, 
    25.49, 34.13, 20.35, 40.21, 16, 24.58, 23.61, 38.94, 36.76, 
    29.68, 15.97, 20.79, 17.83, 14.65, 16.76, 35.74, 31.5, 25.6, 
    32.96, 14.1, 40.4, 24.53, 39.57, 21.38, 14.49, 22.11, 27.12, 
    16.46, 17.65, 37.32, 15.74, 17.07, 28.52, 14.72, 27.75, 36.69, 
    39.47, 26.13, 35.57, 24.08, 24.39, 13.1, 16.75, 24.49, 23.61, 
    15.04, 23.22, 37.3, 36.76, 15.77, 28.34, 35.06, 28.32, 29.39, 
    19.09, 35.68, 35.9, 37.13, 36.1, 40.55, 33.97, 24.03, 37.25, 
    34.39, 13.05, 21.64, 40.02, 26.17, 19.39, 25.76, 40.92, 24.21, 
    20.35, 27.7, 29.53, 14.19, 15.64, 32.74, 31.42, 14.01, 12.85, 
    17.31, 31.67, 23.63, 17.29, 36.71, 18.19, 17.78, 34.87, 36.87, 
    19.27, 24.97, 41.66, 16.83, 34.79, 14.94, 34.39, 40.66, 31.35, 
    31.74, 36.19, 18.28, 37.61, 37.19, 29.58, 17.04, 28.84, 16.6, 
    41.97, 32.36, 27.91, 21.7, 40.45, 35.38, 41.19, 35.68, 19.49, 
    20.94, 23.99, 14.28, 39.24, 12.19, 18.02, 39.14, 40.61, 33.32, 
    38.68, 39.18, 31.76, 22.64, 38.18, 36.75, 30.91, 38.82, 30.68, 
    14.2, 39.34, 18.91, 12.7, 28.2, 37.85, 34.06, 12.88, 40.03, 
    29.95, 14.61, 17.01, 35.64, 20.49, 39.51, 29.29, 18.84, 36.42, 
    37.88, 32.65, 19.7, 19.84, 38.75, 21.25, 40.68, 17.89, 26.3, 
    37.22, 18.03, 17.33, 36.26, 41.98, 19.4, 20.54, 18.6, 26.92, 
    15.23, 20.22, 15.2, 35.49, 15.14, 14.43, 30.82, 14.79, 17.74, 
    36.8, 17.09, 18.09, 19.92, 39.64, 23.87, 22.67, 24.66, 24.33, 
    16.82, 17.91, 21.66, 30.79, 32.91, 25.16, 38.98, 15.49, 21.33, 
    38.47, 34.46, 24.22, 36.93, 22.25, 15.33, 41.38, 34.49, 23.44, 
    30.53, 10.62, 23.8, 28.94, 12.49, 22, 24.51, 14.72, 15.53, 
    23.23, 38.93, 16.06, 19.36, 35.91, 22.2, 15.85, 33.36, 31.75, 
    19.69, 29.86, 16.3, 19.42, 19.17, 14.41, 13.18, 20.67, 17.02
    )), row.names = c(NA, -250L), class = c("tbl_df", "tbl", 
"data.frame"))
user438383
  • 5,716
  • 8
  • 28
  • 43
TY Lim
  • 509
  • 1
  • 3
  • 11
  • 1
    one possibility is to complete the missing dates before applying the rolling function and luckily the tidyverse has a nice funcion for it as [this post explains i.e.](https://stackoverflow.com/questions/48633460/fill-missing-dates-by-group) – DPH Mar 09 '23 at 15:37
  • @DPH I think that might work; simply using `complete` doesn't fill in dates where both `Group` values are missing, but in combination with a calendar dataframe join I think that can be worked around – TY Lim Mar 09 '23 at 16:15
  • Have moved my comment to an answer and expanded it. – G. Grothendieck Mar 29 '23 at 14:46

3 Answers3

2

1) Assuming a mean of 3 days (current point and prior 2 days) rather than 3 rows and that dates are already sorted within Group (which is the case in the question) we calculate the number of rows to use (this will be a vector since each point can have a different number of rows) and use that in rollapplyr. At each row it averages all rows that are prior or at the current row that are within w days prior to the current row. This performs the averaging on the original data frame without adding additional NA rows. You can find additional examples of this in the Example section of ?rollapply.

library(dplyr)
library(zoo)

w <- 3 
data %>%
  group_by(Group) %>%
  mutate(Npoints = 1:n() - findInterval(Date - w, Date),
         Mean3 = rollapplyr(Value, Npoints, mean, partial = TRUE, fill = NA)) %>%
  ungroup

giving:

# A tibble: 250 × 5
   Date       Group Value Npoints Mean3
   <date>     <fct> <dbl>   <int> <dbl>
 1 2021-02-18 A      37.4       1  37.4
 2 2021-02-19 A      26.1       2  31.8
 3 2021-02-19 B      25.5       1  25.5
 4 2021-02-22 A      31.6       1  31.6
 5 2021-02-22 A      27.0       2  29.3
 6 2021-02-22 A      15.3       3  24.6
 7 2021-02-22 A      35.9       4  27.5
 8 2021-02-22 A      28.6       5  27.7
 9 2021-02-22 A      17.1       6  25.9
10 2021-02-22 B      30.4       1  30.4
# … with 240 more rows

2) If instead you want to include rows that are ahead of the current row if they equal the date of the current row then use this. Here L is a list of offset vectors for rollapply to use such that L[[i]] is the vector of offsets to use at the ith row.

data %>%
  group_by(Group) %>%
  mutate(L = lapply(1:n(), 
      \(i) which(Date %in% seq(Date[i] - w, Date[i], "day")) - i),
    Mean3 = rollapplyr(Value, L, mean, partial = TRUE, fill = NA)) %>%
  ungroup %>%
  select(-L)

giving:

# A tibble: 250 × 4
   Date       Group Value Mean3
   <date>     <fct> <dbl> <dbl>
 1 2021-02-18 A      37.4  37.4
 2 2021-02-19 A      26.1  31.8
 3 2021-02-19 B      25.5  25.5
 4 2021-02-22 A      31.6  26.4
 5 2021-02-22 A      27.0  26.4
 6 2021-02-22 A      15.3  26.4
 7 2021-02-22 A      35.9  26.4
 8 2021-02-22 A      28.6  26.4
 9 2021-02-22 A      17.1  26.4
10 2021-02-22 B      30.4  32.0
# ℹ 240 more rows
# ℹ Use `print(n = ...)` to see more rows

3) Another approach is to use sqldf. This one gives a similar answer to (2). Note that group is a reserved word in SQL so we escape it with [...]. It performs a self join on Group and date condition.

library(sqldf)
sqldf("select a.Date, a.[Group], a.Value, avg(b.Value) Mean3
  from data a
  left join data b on a.[Group] = b.[Group] and b.Date between a.Date - 3 and a.Date
  group by a.rowid
  order by a.rowid")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Not to necro this thread but I'm revisiting this problem again - @g-grothendieck it looks like you're getting different `Mean3` values even within the same combination of `Date` and `Group`, which shouldn't be the case, right? if the idea is to take the mean of a given group over the last 3 days. So for instance I would expect rows 4-9 to all have the same value (25.9, in this case) – TY Lim Apr 28 '23 at 15:22
  • 1
    It calculates the mean of the last three days up to the current row. If you want to include forward rows that are equal to the current row's date then have added that as a second option. – G. Grothendieck Apr 28 '23 at 17:59
  • I see, thanks. Would it also work then to summarise by the last value for each `Date` / `Group` combination? (Not sure what the syntax for that would be) – TY Lim Apr 28 '23 at 19:48
  • If what you are asking is to add a column that has the last Value for all rows that have the same Group and date as the current row then `data %>% group_by(Group, Date) %>% mutate(Last = last(Value)) %>% ungroup` or if you are looking to just keep the last row within each Group/Date then `data %>% slice_tail(n = 1, by = c(Group, Date))` – G. Grothendieck Apr 29 '23 at 02:58
0

Edit: After revisiting the thread I'd like to propose another solution using my function time_roll_mean().

It accounts for time gaps, duplicates, and accepts groups through the g argument which is specialised as it only performs a single calculation for all groups instead of one calculation per group.

It accepts unsorted data, and where there are duplicates, the mean of each last duplicate is propagated across each duplicate group.

It also accepts lubridate period and duration objects.

The downside is that it can only calculate "right-aligned" rolling means.

# Uncomment below to install timeplyr
# remotes::install_github("NicChr/timeplyr")
library(timeplyr)
library(dplyr)
library(lubridate)

data %>%
  mutate(mean = time_roll_mean(Value, window = days(3), time = Date, g = Group,
                               close_left_boundary = TRUE))
#> # A tibble: 250 x 4
#>    Date       Group Value  mean
#>    <date>     <fct> <dbl> <dbl>
#>  1 2021-02-18 A      37.4  37.4
#>  2 2021-02-19 A      26.1  31.8
#>  3 2021-02-19 B      25.5  25.5
#>  4 2021-02-22 A      31.6  26.4
#>  5 2021-02-22 A      27.0  26.4
#>  6 2021-02-22 A      15.3  26.4
#>  7 2021-02-22 A      35.9  26.4
#>  8 2021-02-22 A      28.6  26.4
#>  9 2021-02-22 A      17.1  26.4
#> 10 2021-02-22 B      30.4  32.0
#> # i 240 more rows

Created on 2023-07-24 with reprex v2.0.2

NicChr
  • 858
  • 1
  • 9
-1
library(dplyr)
library(zoo)

# Create a calendar dataframe with the full set of dates
calendar <- data.frame(Date = seq(min(data$Date), max(data$Date), by = "day"))

# join data and calendar by "Date" and "Group" columns
data_full <- full_join(data, calendar, by = c("Date"))

# Group the data by date and group and calculate the summary statistics of the value
grouped <- data_full %>% 
  group_by(Date, Group) %>% 
  summarise(mean = mean(Value)) 

# Group the resulting summary statistics data by group
grouped_by_group <- grouped %>% 
  group_by(Group) 

# Use rollapply() to calculate the moving average for each group separately
window_length <- 7  # the desired number of days for the moving average window
grouped_by_group <- grouped_by_group %>% 
  mutate(rolling = rollapply(mean, window_length, mean, fill = NA, align = "right"))
Ilia Tetin
  • 52
  • 4
  • This is what I'd previously tried, but it doesn't work because dates for which one or other group is missing will have `NA` for `Group`, so they end up not included in the final `grouped_by_group` dataframe on which the rolling mean is applied. (I could make the calendar dataframe with the full set of value for `Group` for every `Date` but on my actual data there are many combinations of grouping variables so doing this would be quite tedious and I'm hoping for a better solution) – TY Lim Mar 09 '23 at 16:01
  • This can work if you add the argument `na.rm = TRUE` to the `rollapply` call. Also note that one can use `rollapplyr` with an r on the end to avoid writing out `align = "right"` – G. Grothendieck Mar 29 '23 at 15:03