Dplyr Lags on Summarised Grouped Data

Question

Using dplyr, I'm looking to summarise a new column of data as a lagged version of an existing column of grouped data.

Reprex:

    dateidx <- as.Date(c("2019-01-02", "2019-01-032", "2019-01-02", "2019-01-07", "2019-01-07", "2019-01-07", "2019-01-10", "2019-01-10"))
    A <- c(100, 100, 200, 200, 200, 300, 400, 135)
    B <- c(1500, 2000, 1350, 780, 45, 200, 150, 250)
    test.df1 <- data.frame(dateidx, A, B)

> test.df1
     dateidx   A    B
1 2019-01-02 100 1500
2 2019-01-03 100 2000
3 2019-01-02 200 1350
4 2019-01-07 200  780
5 2019-01-07 200   45
6 2019-01-07 300  200
7 2019-01-10 400  150
8 2019-01-10 135  250

Objective:

>dateidx <- c("2019-01-02","2019-01-03", "2019-01-07", "2019-01-10")
> sumA <- c(300, 100, 700, 535)
> sumAlag <- c(NA, 300, 100, 700)
> meanB <- c(1425, 2000, 342, 200)
> meanBlag <- c(NA, 1425, 2000, 342)
> test.obj <- data.frame (dateidx2, sumA, sumAlag, meanB, meanBlag)

> test.obj

    dateidx sumA sumAlag meanB meanBlag
1 2019-01-02  300      NA  1425       NA
2 2019-01-03  100     300  2000     1425
3 2019-01-07  700     100   342     2000
4 2019-01-10  535     700   200      342

My code:

test.df2 <- test.df1 %>%
  group_by(dateidx) %>%
  summarise (
    sumA = sum(A), 
    sumAlag = lag(sumA),
    meanB = mean(B), 
    meanBlag =dplyr::lag(meanB)
  )

Desired Results:

    dateidx sumA sumAlag meanB meanBlag
1 2019-01-02  300      NA  1425       NA
2 2019-01-03  100     300  2000     1425
3 2019-01-07  700     100   342     2000
4 2019-01-10  535     700   200      342

Actual Results:

> test.df2
# A tibble: 4 × 5
  dateidx     sumA sumAlag meanB meanBlag
  <date>     <dbl>   <dbl> <dbl>    <dbl>
1 2019-01-02   300      NA 1425        NA
2 2019-01-03   100      NA 2000        NA
3 2019-01-07   700      NA  342.       NA
4 2019-01-10   535      NA  200        NA

Attempts to Fix:

disambiguation (like the dplyr / plyr:: mutate issue)

making a dummy variable

re-specifying the grouping using "order-by"

a dplyr error which was fixed years ago

Sources:

https://dplyr.tidyverse.org/reference/lead-lag.html: but no grouping discussed

https://dplyr.tidyverse.org/reference/lead-lag.html: leads or lags at first order of granularity only

dplyr: lead() and lag() wrong when used with group_by(): about making sure the lag is disambiguated: other issues not relevant

Lag function on grouped data: for python

dplyr lag across groups: lagging explicitly not within groups

score 1 · Accepted Answer · answered May 06 '22 at 22:18

1

You want to first summarise to get the sum and the mean, then you can use a mutate statement to get the lag of each column, then rearrange the columns.

library(tidyverse)

test.df2 <- test.df1 %>%
  group_by(dateidx) %>%
  summarise(sumA = sum(A),
            meanB = mean(B)) %>%
  mutate(sumAlag = lag(sumA),
         meanBlag = lag(meanB)) %>% 
  select(dateidx, starts_with("sum"), starts_with("mean"))

Output

  dateidx     sumA sumAlag meanB meanBlag
  <date>     <dbl>   <dbl> <dbl>    <dbl>
1 2019-01-02   300      NA 1425       NA 
2 2019-01-03   100     300 2000     1425 
3 2019-01-07   700     100  342.    2000 
4 2019-01-10   535     700  200      342.

answered May 06 '22 at 22:18

AndrewGB

16,126
5
18
49

thanks for the wonderfully simple answer. I would appreciate a comment on why having a "mutate" after the "summarise" works. I thought "mutate" ungroups so it didn't occur to me to try it. – W Barker May 06 '22 at 22:25
@WBarker The `summarise` statement will drop groups by default, but you can always specify in it to keep them `.groups = "keep"`. But it seemed like you just needed the lag of the summarized data, which in that case you don't need to group by the date (unless I have misunderstood). In `summarise`, you were trying to get the `lag` for each group, but there's only one value for each group, so it returns `NA`. So, we can easily get the `lag` (ungrouped) in the mutate statement. – AndrewGB May 06 '22 at 22:32
I got my real routine working (thanks to you), but your additional comment raises another question: (maybe just terminology): when you refer to "summarized" data, isn't that the same as the "grouped" data?. I ask because I do want (and now have), a lag by the grouped data -- in other words, rows grouped by date. The real data adds all the ending open positions of individual stocks for each day; what I needed was yesterday's sum of the same thing. I checked the results in excel and your solution works. Thanks again. – W Barker May 06 '22 at 22:59
Yes, when I refer to summarised data, I'm referring to the group summary that happens in the `summarise` step above. Then, the `mutate` statement just gets the previous row above (which is not grouped). But glad that it checks out in excel! – AndrewGB May 06 '22 at 23:08

Dplyr Lags on Summarised Grouped Data

1 Answers1

Linked