Using dplyr mutate function to create new variable conditionally based on current row

Question

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:

Data Table

What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:

   mutate(average = case_when(
    Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
    Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
    Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
    Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
  ),

However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.

Thanks, Steven

dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )

Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557 (and https://xkcd.com/2116/). Please just include the code, console output, or data (e.g., `dput(head(x))` or `data.frame(...)`) directly. — r2evans, Jun 02 '20 at 18:56
Here are some good references for providing a self-contained, reproducible question: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. — r2evans, Jun 02 '20 at 20:14

r2evans · Accepted Answer · 2020-06-02T20:55:54.487

It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs. I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.

library(dplyr)
set.seed(42)
dat <- tibble(
  Flu.Year = sample(2016:2020, size=100, replace=TRUE),
  count = sample(1000, size=100, replace=TRUE)
)

dat %>%
  group_by(Flu.Year) %>%
  mutate(average = mean(count)) %>%
  # just to show a quick summary
  slice(1:3) %>%
  ungroup()
# # A tibble: 15 x 3
#    Flu.Year count average
#       <int> <int>   <dbl>
#  1     2016   734    578.
#  2     2016   356    578.
#  3     2016   411    578.
#  4     2017   217    436.
#  5     2017   453    436.
#  6     2017   920    436.
#  7     2018   963    558 
#  8     2018   609    558 
#  9     2018   536    558 
# 10     2019   943    543.
# 11     2019   740    543.
# 12     2019   536    543.
# 13     2020   627    494.
# 14     2020   218    494.
# 15     2020   389    494.

An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.

dat %>%
  group_by(Flu.Year) %>%
  summarize(average = mean(count))
# # A tibble: 5 x 2
#   Flu.Year average
#      <int>   <dbl>
# 1     2016    578.
# 2     2017    436.
# 3     2018    558 
# 4     2019    543.
# 5     2020    494.

dat %>%
  group_by(Flu.Year) %>%
  summarize(average = mean(count)) %>%
  full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
#    Flu.Year average count
#       <int>   <dbl> <int>
#  1     2016    578.   734
#  2     2016    578.   356
#  3     2016    578.   411
#  4     2016    578.   720
#  5     2016    578.   851
#  6     2016    578.   822
#  7     2016    578.   465
#  8     2016    578.   679
#  9     2016    578.    30
# 10     2016    578.   180
# # ... with 90 more rows

The result, after chat:

tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 )  %>%
  arrange(Flu.Year, Week.Number) %>%
  group_by(Week.Number) %>%
  mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups:   Week.Number [3]
#   Flu.Year Week.Number count year_week.average
#      <int>       <int> <int>             <dbl>
# 1     2016           1     1              NA  
# 2     2016           2     2              NA  
# 3     2016           3     3              NA  
# 4     2017           1     4               1  
# 5     2017           2     5               2  
# 6     2017           3     6               3  
# 7     2018           1     7               2.5
# 8     2018           2     8               3.5
# 9     2018           3     9               4.5

Thanks for this suggestion--however, I'm still unsure how to go about calculating the averages for each week based on that same week in prior years (e.g., average volume in week 1 for 2016,2017,etc.) — Steven Morrison, Jun 02 '20 at 19:48
Have you ever used `group_by` before? It might be as simple as changing this code to use `group_by(Flu.Year, Week)` to get an average for each week of the year. If not, it would be useful to have usable sample data in your question. — r2evans, Jun 02 '20 at 19:55
I contemplated using group_by function, as a summary table would would get me to an average of each week over all the years, but that's still note quite what I'm looking for. I need an average for each week over prior years. Each column thus would have a different value. Still requires some conditional where Flu.Year < current Flu.Year — Steven Morrison, Jun 02 '20 at 20:01
Okay. More the point ... please provide **sample data**, and please also add **intended output** given that sample data. — r2evans, Jun 02 '20 at 20:13
Here's how it would be set up: dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) ) — Steven Morrison, Jun 02 '20 at 20:18
I don't know how to model the output. All I know to tell you is that, for instance, for the row where Flu.Year = 2017 and Week.Number = 1, the average in the new column should be the average of counts in prior years (2016) during Week 1 only. — Steven Morrison, Jun 02 '20 at 20:20
I [edit]ed your question for you to place that code in your question. It will never produce exactly the same numbers you have on your console, though, since the data is random. If you want something ***reproducible***, use `set.seed` first. — r2evans, Jun 02 '20 at 20:20
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215198/discussion-between-steven-morrison-and-r2evans). — Steven Morrison, Jun 02 '20 at 20:24

score 3 · Answer 2 · answered Jun 02 '20 at 19:04

3

We can use aggregate from base R

aggregate(count ~ Flu.Year, data, FUN = mean)

answered Jun 02 '20 at 19:04

akrun

874,273
37
540
662

Using dplyr mutate function to create new variable conditionally based on current row

2 Answers2