How to produce a year over year calculated column in R

Question

First, is the data and then the manipulations. Finally, is the current method that I am using and as of yet is producing no data. The manipulations are to create a date and then create a rolling 12-Month average.

   Monthavg<- 
   c(20185,20186,20187,20188,20189,201810,201811,201812,20191,20192,20193,20194,20195,20196,
      20197,20198,20199,201910,201911,201912,20201
      ,20202,20203,20204,20205,20206,20207
      ,20208,20209,202010,202011)

  empavg<-c(2,4,6,7,8,10,12,14,16,18,20,22,24,26,28,30,32,36,36,38,40,42,44,46,48,48,50,52,52,54,56)

  ces12f <- data.frame(Monthavg,empavg)

Manipulations

 ces12f<- ces12f %>% mutate(year = substr(as.character(Monthavg),1,4),
              month = substr(as.character(Monthavg),5,7),
              date = as.Date(paste(year,month,"1",sep ="-")))
 Month_ord <- order(Monthavg)
 span_month=12
 ces12f<-ces12f %>% mutate(ravg = zoo::rollmeanr(empavg, 12, fill = NA))

Annual difference attempt

 ces12f<- ces12f%>%
 group_by(Monthavg)%>%
 mutate(PreviousYear=lag(ravg,12), 
     PreviousMonth=lag(ravg),
     AnnualDifference=ravg-PreviousYear)%>%
 ungroup()

The end goal would be that 202011 minus 201911 or 47.5 minus 25.17 or 22.3. The method that I use above produces nothing but NA's. Any insights as to how I can modify my existing code or simply use an entirely different method would be greatly appreciated.

r2evans · Accepted Answer · 2020-12-28T20:34:04.860

I tend to be a little more paranoid. That is, if there is even a slight chance that we are missing one month of however many years we have, than doing a lag(..., 12) is a bad idea, even worse because you will get no warnings or errors, and your data will be wrong.

As such, I'm going to recommend a self-join.

transmute(ces12f, year = as.character(as.integer(year) + 1L), month, lastravg = ravg) %>%
  left_join(ces12f, ., by = c("year", "month"))
#    Monthavg empavg year month       date     ravg lastravg
# 1     20185      2 2018     5 2018-05-01       NA       NA
# 2     20186      4 2018     6 2018-06-01       NA       NA
# 3     20187      6 2018     7 2018-07-01       NA       NA
# 4     20188      7 2018     8 2018-08-01       NA       NA
# 5     20189      8 2018     9 2018-09-01       NA       NA
# 6    201810     10 2018    10 2018-10-01       NA       NA
# 7    201811     12 2018    11 2018-11-01       NA       NA
# 8    201812     14 2018    12 2018-12-01       NA       NA
# 9     20191     16 2019     1 2019-01-01       NA       NA
# 10    20192     18 2019     2 2019-02-01       NA       NA
# 11    20193     20 2019     3 2019-03-01       NA       NA
# 12    20194     22 2019     4 2019-04-01 11.58333       NA
# 13    20195     24 2019     5 2019-05-01 13.41667       NA
# 14    20196     26 2019     6 2019-06-01 15.25000       NA
# 15    20197     28 2019     7 2019-07-01 17.08333       NA
# 16    20198     30 2019     8 2019-08-01 19.00000       NA
# 17    20199     32 2019     9 2019-09-01 21.00000       NA
# 18   201910     36 2019    10 2019-10-01 23.16667       NA
# 19   201911     36 2019    11 2019-11-01 25.16667       NA
# 20   201912     38 2019    12 2019-12-01 27.16667       NA
# 21    20201     40 2020     1 2020-01-01 29.16667       NA
# 22    20202     42 2020     2 2020-02-01 31.16667       NA
# 23    20203     44 2020     3 2020-03-01 33.16667       NA
# 24    20204     46 2020     4 2020-04-01 35.16667 11.58333
# 25    20205     48 2020     5 2020-05-01 37.16667 13.41667
# 26    20206     48 2020     6 2020-06-01 39.00000 15.25000
# 27    20207     50 2020     7 2020-07-01 40.83333 17.08333
# 28    20208     52 2020     8 2020-08-01 42.66667 19.00000
# 29    20209     52 2020     9 2020-09-01 44.33333 21.00000
# 30   202010     54 2020    10 2020-10-01 45.83333 23.16667
# 31   202011     56 2020    11 2020-11-01 47.50000 25.16667

You can verify that each lastempavg is the previous year's value, and you can mutate the difference normally, perhaps

transmute(ces12f, year = as.character(as.integer(year) + 1L), month, lastravg = ravg) %>%
  left_join(ces12f, ., by = c("year", "month")) %>%
  mutate(AnnualDifference = ravg - lastravg)
#    Monthavg empavg year month       date     ravg lastravg AnnualDifference
# 1     20185      2 2018     5 2018-05-01       NA       NA               NA
# 2     20186      4 2018     6 2018-06-01       NA       NA               NA
# 3     20187      6 2018     7 2018-07-01       NA       NA               NA
# 4     20188      7 2018     8 2018-08-01       NA       NA               NA
# 5     20189      8 2018     9 2018-09-01       NA       NA               NA
# 6    201810     10 2018    10 2018-10-01       NA       NA               NA
# 7    201811     12 2018    11 2018-11-01       NA       NA               NA
# 8    201812     14 2018    12 2018-12-01       NA       NA               NA
# 9     20191     16 2019     1 2019-01-01       NA       NA               NA
# 10    20192     18 2019     2 2019-02-01       NA       NA               NA
# 11    20193     20 2019     3 2019-03-01       NA       NA               NA
# 12    20194     22 2019     4 2019-04-01 11.58333       NA               NA
# 13    20195     24 2019     5 2019-05-01 13.41667       NA               NA
# 14    20196     26 2019     6 2019-06-01 15.25000       NA               NA
# 15    20197     28 2019     7 2019-07-01 17.08333       NA               NA
# 16    20198     30 2019     8 2019-08-01 19.00000       NA               NA
# 17    20199     32 2019     9 2019-09-01 21.00000       NA               NA
# 18   201910     36 2019    10 2019-10-01 23.16667       NA               NA
# 19   201911     36 2019    11 2019-11-01 25.16667       NA               NA
# 20   201912     38 2019    12 2019-12-01 27.16667       NA               NA
# 21    20201     40 2020     1 2020-01-01 29.16667       NA               NA
# 22    20202     42 2020     2 2020-02-01 31.16667       NA               NA
# 23    20203     44 2020     3 2020-03-01 33.16667       NA               NA
# 24    20204     46 2020     4 2020-04-01 35.16667 11.58333         23.58333
# 25    20205     48 2020     5 2020-05-01 37.16667 13.41667         23.75000
# 26    20206     48 2020     6 2020-06-01 39.00000 15.25000         23.75000
# 27    20207     50 2020     7 2020-07-01 40.83333 17.08333         23.75000
# 28    20208     52 2020     8 2020-08-01 42.66667 19.00000         23.66667
# 29    20209     52 2020     9 2020-09-01 44.33333 21.00000         23.33333
# 30   202010     54 2020    10 2020-10-01 45.83333 23.16667         22.66667
# 31   202011     56 2020    11 2020-11-01 47.50000 25.16667         22.33333

Side note on this: it might be better to keep the year and month stored as integer, for a few reasons: (1) it makes this kind of thing quite easy; (2) it preserves ordinality, whereas arrange(ces12f, month) will happily order the months as 1, 10, 11, 12, 2, etc; (3) (subjective) they really are integers, after all.

@r2evans, what is the middle element in the first section of the code. "(ces12f, ., by = c". Specifically, the , ., — Tim Wilcox, Dec 28 '20 at 20:40
Two things: (1) With `magrittr::%>%`, the data resulting from the previous (LHS) command in the pipe is injected into the current function whereever the `.` is located in the call; if none is found, it is made the first argument. So technically, your `ces12f %>% group_by(Monthavg)` is really `ces12f %>% group_by(., Monthavg)`. (2) My brain is much more comfortable wrapping around left-joins (I'm not an [ambi-turner](https://www.youtube.com/watch?v=8hJ1HDcMowk) :-), so I forced it to be a left-join. It could also have been `%>% right_join(ces12f, by=...)` with the same results. — r2evans, Dec 28 '20 at 20:43
If this is your first real exposure into merges/joins, then I suggest reading through various answers from https://stackoverflow.com/q/1299871/3358272 and https://stackoverflow.com/a/6188334/3358272. I find the "join" operation *crucial* to many data-wrangling methods. — r2evans, Dec 28 '20 at 20:44
I use joins regularly when I am working with SQL Server Management Studio but had not thought to use it in this instance. Still getting my feet under me in regards to R — Tim Wilcox, Dec 28 '20 at 20:49
My apologies if that came across as condescending, it was certainly not intended that way. I believe many people new to R also have little or no SQL experience, so the concept of "join" is alien. — r2evans, Dec 28 '20 at 20:52

Ian Campbell · Answer 2 · 2020-12-28T20:37:38.383

Here's an approach with tidyr::extract. You can use tidyr::complete to ensure any missing months are filled in:

library(tidyverse)
library(zoo)
ces12f %>%
  mutate(Monthavg = as.character(Monthavg)) %>%
  extract(Monthavg, into = c("Year", "Month"),
          regex = "^([0-9]{4})([0-9]{1,2})$") %>%
  mutate(across(Year:Month, as.integer)) %>%
  arrange(Year,Month) %>%
  complete(Year, Month) %>%
  mutate(ravg = zoo::rollmeanr(empavg,12,NA)) %>%
  mutate(PreviousYear=lag(ravg,12), 
         PreviousMonth=lag(ravg),
         AnnualDifference=ravg-PreviousYear)
   Year Month empavg     ravg PreviousYear PreviousMonth AnnualDifference
1  2018     1     NA       NA           NA            NA               NA
2  2018     2     NA       NA           NA            NA               NA
3  2018     3     NA       NA           NA            NA               NA
4  2018     4     NA       NA           NA            NA               NA
5  2018     5      2       NA           NA            NA               NA
6  2018     6      4       NA           NA            NA               NA
7  2018     7      6       NA           NA            NA               NA
8  2018     8      7       NA           NA            NA               NA
9  2018     9      8       NA           NA            NA               NA
10 2018    10     10       NA           NA            NA               NA
11 2018    11     12       NA           NA            NA               NA
12 2018    12     14       NA           NA            NA               NA
13 2019     1     16       NA           NA            NA               NA
14 2019     2     18       NA           NA            NA               NA
15 2019     3     20       NA           NA            NA               NA
16 2019     4     22 11.58333           NA            NA               NA
17 2019     5     24 13.41667           NA      11.58333               NA
18 2019     6     26 15.25000           NA      13.41667               NA
19 2019     7     28 17.08333           NA      15.25000               NA
20 2019     8     30 19.00000           NA      17.08333               NA
21 2019     9     32 21.00000           NA      19.00000               NA
22 2019    10     36 23.16667           NA      21.00000               NA
23 2019    11     36 25.16667           NA      23.16667               NA
24 2019    12     38 27.16667           NA      25.16667               NA
25 2020     1     40 29.16667           NA      27.16667               NA
26 2020     2     42 31.16667           NA      29.16667               NA
27 2020     3     44 33.16667           NA      31.16667               NA
28 2020     4     46 35.16667     11.58333      33.16667         23.58333
29 2020     5     48 37.16667     13.41667      35.16667         23.75000
30 2020     6     48 39.00000     15.25000      37.16667         23.75000
31 2020     7     50 40.83333     17.08333      39.00000         23.75000
32 2020     8     52 42.66667     19.00000      40.83333         23.66667
33 2020     9     52 44.33333     21.00000      42.66667         23.33333
34 2020    10     54 45.83333     23.16667      44.33333         22.66667
35 2020    11     56 47.50000     25.16667      45.83333         22.33333
36 2020    12     NA       NA     27.16667      47.50000               NA

@r2evans Ha! I literally thought the same thing as I was typing it. It's so much easier than `library(dplyr); library(tidyr)` though.... — Ian Campbell, Dec 28 '20 at 20:38
I've started comment-rant-wars (there is no better word) when I suggest against `library(tidyverse)`, good rationale on both sides rests in pedagogy. \*shrug\* — r2evans, Dec 28 '20 at 20:40

How to produce a year over year calculated column in R

2 Answers2