3

I am trying to solve a problem: run a statistic (count; sum; mean) over an irregular time series data set, where the window-size for each line is within a given date range (preferably over a grouping column). I have found that ORACLE SQL supports this through:

 COUNT(*) OVER (
    ORDER BY payment_date 
    RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
  )

And in R I have built functions that use lists to collect vectors of values for each row, but this is expensive and slow. The best solution I have found is by user: mgahan is his package boRingTrees: R: fast sliding window with given coordinates

library("devtools")
install_github("boRingTrees","mgahan")
library("boRingTrees")

set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
                         32,34,66,97,151,188,211,213,241,274,294,321,
                         33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Target)

data[,Roll:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates",
        target="Target",lower=0,upper=31,incbounds=T,stat=sum,na.rm=T,cores=1)]

However, when I run this against larger data sets, it also runs quite slowly.

What I have tried:

  • To use lists in loops to return window partitions, but this is very slow.
  • Importing user's functions, such as boRingTrees, which encapsulate the problem well - but are also slow.

What I have learnt:

  • There is good support in R for physical partitions (up one row, group into days/weeks, etc) through zoo and rollapply, but limited support for Ranged partitions (all lines within this number of hours from a timestamp).

What I think I need:

I have come to the conclusion that I need a C function to more speedily run a sliding window over a range of dates. I have started playing with C++ in R, and these two Rcpp efforts come close (in technique) to what I think I need:

R: Rolling window function with adjustable window and step-size for irregularly spaced observations

R: fast sliding window with given coordinates

I hope this summary is useful collation of information for people trying to solve similar problems (I found searching on this topic difficult - sparse information and very different ways to describe similar things). Hopefully someone can assist me in building a faster C++ solution I can run in R (inline or .cpp). Here is a sample data set (again, courtesy of mgahan):

Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
                         32,34,66,97,151,188,211,213,241,274,294,321,
                         33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Val <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Val)

e.g:

data[,RowRollCount31:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates",  target="Val",lower=0,upper=31,incbounds=T,stat=length,na.rm=T)]

Ideally, the solution would use the 'interval' option as in the Oracle example (i.e windows within 'x' & 'hours' of each row), and also the 'group by'/'by_list' and 'stat' options that mgahan cleverly catered for.

Further reading /a good explanation of the problem:

https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/

Many thanks in advance!

Community
  • 1
  • 1
snowkeep
  • 107
  • 4
  • users interested in such a feature to R package please upvote related FR: https://github.com/Rdatatable/data.table/issues/3241 – jangorecki Sep 20 '19 at 13:22

0 Answers0