1

I'm using dbplyr to access data in Redshift tables and Tidyverse to do the data wrangling. I'm trying to calculate a rolling sum amount over 7 days.

Unfortunately, none of the packages that I've found suggested on stack overflow, such as 'slider' and 'runner', seem to be compatible with dbplyr (or with the Redshift tables and sql that I'm using).

I'm trying to achieve similar results to this table. In this example, the value in the cum_sum_7Days column is the sum of values in the amount column where the start_date fits between the start_date and previous_7Day columns.

Acct          Start_Date           Previous_7Day           Amount       Cum_sum_7Days

YYYY       8/07/2022 7:04      1/07/2022 7:04               500            500
YYYY       8/07/2022 12:49     1/07/2022 12:49              200            700
YYYY       9/07/2022 11:47     2/07/2022 11:47              300            1000
YYYY       9/07/2022 11:52     2/07/2022 11:52              45.6           1045.6
YYYY       12/07/2022 13:03    5/07/2022 13:03              200            1245.6
YYYY       15/07/2022 13:53    8/07/2022 13:53              200            745.6
YYYY       16/07/2022 12:58    9/07/2022 12:58              300            700
YYYY       16/07/2022 13:28    9/07/2022 13:28              500            1200
YYYY       19/07/2022 12:22    12/07/2022 12:22             200            1400
YYYY       23/07/2022 5:52     16/07/2022 5:52              200            1200
YYYY       26/07/2022 13:01    19/07/2022 13:01             100            300
YYYY       29/07/2022 13:50    22/07/2022 13:50             200            500
YYYY       30/07/2022 13:57    23/07/2022 13:57             300            600
YYYY       3/08/2022 6:17      27/07/2022 6:17              200            700
YYYY       5/08/2022 13:30     29/07/2022 13:30             200            900
YYYY       9/08/2022 13:44     2/08/2022 13:44              200            600
YYYY       12/08/2022 12:13    5/08/2022 12:13              200            600

Note that:

  • My dates are not consecutive
  • The date-time fields are required as the 7 days must be accurate to the hour-minute
  • The rolling window size may be changed from 7 days (14 days, 1 year, etc.)

So, any solution need to handle this.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
winlai
  • 79
  • 5
  • Could you help us reproduce your problem using [reproducible example data and code](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Andy Baxter Feb 24 '23 at 09:54
  • I am struggling to understand your definitions. Is `cum_sum_7Days` the sum of all `amount` values when `start_date` is between `start_date` and `previous_7Day`? – Simon.S.A. Feb 25 '23 at 08:57
  • Hi @Simon.S.A - that is correct. The `cum_sum_7days` is the sum of the amounts between the `start_date` and `previous_7Day column`. So for example the amount on the 19/07/2022 12:22 is 200 - what the `cum_sum_7Days` would show is the 200 plus the previous 7 days amount up to 12/07/2022 12:22 and a rolling window for subsequent values. – winlai Feb 26 '23 at 00:58
  • Hi @AndyBaxter unfortunately im not sure how to share an example data as the data im using is from a database and the data ive shown above is a small snippet of it. But ive used the following code to try and produce the rolling 7 days cum sum window: `library(runner) library(dbplyr)` `tbl1 %>% group by(Acct) %>% mutate(cum_rolling_10 = sum_run( x = tbl1$Amount, k = 7, idx = Start_Date))` where Start_Date is a dttm date type. Seems like my database is not compatible with runner – winlai Feb 26 '23 at 01:18
  • This is the error i get when i use the `runner` package and `sum_run`: `Error in `purrr::pmap()`: ℹ In index: 2. ℹ With name: cum_rolling_7. Caused by error in `purrr::map_chr()`: ℹ In index: 1. Caused by error in `UseMethod()`: ! no applicable method for 'escape' applied to an object of class "c('tbl_Redshift', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')" Run `rlang::last_error()` to see where the error occurred.` – winlai Feb 26 '23 at 01:20

2 Answers2

0

Yes, this should be possible with dbplyr. But the method will look different that via pure R.

dbplyr works by translating tidyverse commands into SQL. If no translation is defined for a function then it is unlikely to work correctly. This is why using functions like sum_run from the runner package give a no applicable method error. dbplyr does not know how to translate them into SQL.

The method we can use follows similar logic to how we would approach this problem in SQL:

output = tbl1 %>%
  inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
  filter(
    Previous_7Day.x <= Start_Date.y,
    Start_Date.y <= Start_Date.x
  ) %>%
  group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
  summarise(Cum_sum_7Days = sum(Amount.y) %>%
  select(
    Acct = Acct.x,
    Start_Date = Start_Date.x,
    Previous_7Day = Previous_7Day.x,
    Amount = Amount.x,
    Cum_sum_7Days
  )

The core of the idea is to join the table to itself. After filtering we have all combinations where the .y date is between the .x dates. So we can sum all the .y amounts to produce the rolling sum.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Thanks @Simon.S.A. your method worked. I was hoping you could help with another issue im now facing. Im now trying to also get the rolling one year window but i cant seem to get it using the usual functions from dplyr with the database. For example i tried using: `test <- output %>% mutate(rolling_1year = Start_Date %m+% years(-1))` i would normally use this but its thrown an error when i use it within the database so assume its because dbplyr doesn't know how to translate this. Because of the data size i wouldn't be able to collect the data then create the rolling window column – winlai Feb 27 '23 at 01:04
  • I've also tried this method that i found in SO `test <- output %>% mutate(rolling_1year = sql('DATEADD(dd, -1, Start_Date)'))` but dbplyr is giving me "Invalid datetime part for DATEADD() error" even though my Start_Date is of date time format. – winlai Feb 27 '23 at 01:05
  • This probably need its own question. Take a look at this: https://stackoverflow.com/questions/66938790/handle-dates-with-dbplyr-using-pure-r and if it doesn't help, start a new question for me to take a look at. – Simon.S.A. Feb 27 '23 at 01:44
  • Thanks again @Simon.S.A. - i've posted a specific question regarding my new question about working with date time in dbplyr now: [https://stackoverflow.com/questions/75576472/add-subtract-date-from-date-time-format-in-dbplyr]. And thank you for also linking the other question on handling dates with dbplyr. – winlai Feb 27 '23 at 03:15
0

A variant on the other answer using join_by:

library(dplyr, warn.conflicts = FALSE)
library(DBI)

df <- tribble(
  ~Acct,  ~Start_Date,                ~Amount,
  "YYYY", "2022-07-08 07:04:00 AEST", 500,    
  "YYYY", "2022-07-08 12:49:00 AEST", 200,    
  "YYYY", "2022-07-09 11:47:00 AEST", 300,    
  "YYYY", "2022-07-09 11:52:00 AEST", 45.6,   
  "YYYY", "2022-07-12 13:03:00 AEST", 200,    
  "YYYY", "2022-07-15 13:53:00 AEST", 200,    
  "YYYY", "2022-07-16 12:58:00 AEST", 300,    
  "YYYY", "2022-07-16 13:28:00 AEST", 500,    
  "YYYY", "2022-07-19 12:22:00 AEST", 200,    
  "YYYY", "2022-07-23 05:52:00 AEST", 200,    
  "YYYY", "2022-07-26 13:01:00 AEST", 100,    
  "YYYY", "2022-07-29 13:50:00 AEST", 200,    
  "YYYY", "2022-07-30 13:57:00 AEST", 300,    
  "YYYY", "2022-08-03 06:17:00 AEST", 200,    
  "YYYY", "2022-08-05 13:30:00 AEST", 200,    
  "YYYY", "2022-08-09 13:44:00 AEST", 200,    
  "YYYY", "2022-08-12 12:13:00 AEST", 200
) %>%
  mutate(
    Start_Date = as.POSIXct(Start_Date)
  )

pg <- dbConnect(RPostgres::Postgres())

tbl1 <-
  df %>% 
  copy_to(pg, ., overwrite = TRUE)

tbl1 %>%
  mutate(Previous_7Day = Start_Date - days(7)) %>%
  inner_join(tbl1, 
             join_by(Acct, 
                     between(y$Start_Date, x$Previous_7Day, x$Start_Date)),
             suffix = c("", "_prev")) %>%
  group_by(Acct, Start_Date, Amount) %>%
  summarize(Cum_sum_7Days = sum(Amount_prev)) %>%
  arrange(Start_Date)          
#> `summarise()` has grouped output by "Acct" and "Start_Date". You can override
#> using the `.groups` argument.
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source:     SQL [?? x 4]
#> # Database:   postgres  [iangow@/tmp:5432/iangow]
#> # Groups:     Acct, Start_Date
#> # Ordered by: Start_Date
#>    Acct  Start_Date          Amount Cum_sum_7Days
#>    <chr> <dttm>               <dbl>         <dbl>
#>  1 YYYY  2022-07-07 21:04:00  500            500 
#>  2 YYYY  2022-07-08 02:49:00  200            700 
#>  3 YYYY  2022-07-09 01:47:00  300           1000 
#>  4 YYYY  2022-07-09 01:52:00   45.6         1046.
#>  5 YYYY  2022-07-12 03:03:00  200           1246.
#>  6 YYYY  2022-07-15 03:53:00  200            746.
#>  7 YYYY  2022-07-16 02:58:00  300            700 
#>  8 YYYY  2022-07-16 03:28:00  500           1200 
#>  9 YYYY  2022-07-19 02:22:00  200           1400 
#> 10 YYYY  2022-07-22 19:52:00  200           1200 
#> # … with more rows

Created on 2023-04-03 with reprex v2.0.2

Ian Gow
  • 3,098
  • 1
  • 25
  • 31