1

I am trying to get summary statistics (sum and max here) with most N recent values.

Starting data:

dt = data.table(id = c('a','a','a','a','b','b','b','b'),
                week = c(1,2,3,4,1,2,3,4),
                value = c(2, 3, 1, 0, 5, 7,3,2))

Desired result:

dt = data.table(id = c('a','a','a','a','b','b','b','b'),
                    week = c(1,2,3,4,1,2,3,4),
                    value = c(2, 3, 1, 0, 5, 7,3,2),
                    sum_recent2week = c(NA, NA, 5, 4, NA, NA, 12, 10),
                    max_recent2week = c(NA, NA, 3, 3, NA, NA, 7, 7))

With the data, I would like to have sum and max of 2 (N=2) most recent values for each row by id. 4th(sum_recent2week) and 5th (max_recent2week) columns are my desired columns

John legend2
  • 840
  • 9
  • 18
  • 1
    Related [R dplyr rolling sum](https://stackoverflow.com/questions/30153835/r-dplyr-rolling-sum) – markus Jan 17 '19 at 20:36

2 Answers2

4

You can use rollsum and rollmax from the zoo package.

dt[, `:=`(sum_recent2week = 
            shift(rollsum(value, 2, align = 'left', fill = NA), 2),
          max_recent2week = 
            shift(rollmax(value, 2, align = 'left', fill = NA), 2))
   , id]

For the sum, if you're using data table version >= 1.12, you can use data.table::frollmean. The default for frollmean is fill = NA, so no need to specify that in this case.

dt[, `:=`(sum_recent2week = 
            shift(frollmean(value, 2, align = 'left')*2, 2),
          max_recent2week = 
            shift(rollmax(value, 2, align = 'left', fill = NA), 2))
   , id]
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

I'm sure it can be done in a much more elegant way, but here is one tidyverse possibility:

dt %>%
 group_by(id) %>%
 mutate(sum_recent2week = lag(value + lead(value), n = 2),
        max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1))) %>%
 rowid_to_column() %>%
 select(-week, -value) %>%
 top_n(-2) %>%
 right_join(dt %>%
            rowid_to_column(), by = c("rowid" = "rowid",
                                      "id" = "id")) %>%
 select(-rowid)

  id    sum_recent2week max_recent2week  week value
  <chr>           <dbl>           <dbl> <dbl> <dbl>
1 a                 NA              NA     1.    2.
2 a                 NA              NA     2.    3.
3 a                  5.              3.    3.    1.
4 a                  4.              3.    4.    0.
5 b                 NA              NA     1.    5.
6 b                 NA              NA     2.    7.
7 b                 12.              7.    3.    3.
8 b                 10.              7.    4.    2.

First, it is computing the "sum_recent2week" and "max_recent2week" per group. Second, it selects the last two rows per group. Finally, it merges it with the original data.

Or if you want to compute it for all rows, not just for the last two rows per group:

dt %>%
 group_by(id) %>%
 mutate(sum_recent2week = lag(value + lead(value), n = 2),
        max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1)))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67