I am trying to solve a problem: run a statistic (count; sum; mean) over an irregular time series data set, where the window-size for each line is within a given date range (preferably over a grouping column). I have found that ORACLE SQL supports this through:
COUNT(*) OVER (
ORDER BY payment_date
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)
And in R I have built functions that use lists to collect vectors of values for each row, but this is expensive and slow. The best solution I have found is by user: mgahan is his package boRingTrees: R: fast sliding window with given coordinates
library("devtools")
install_github("boRingTrees","mgahan")
library("boRingTrees")
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Target)
data[,Roll:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates",
target="Target",lower=0,upper=31,incbounds=T,stat=sum,na.rm=T,cores=1)]
However, when I run this against larger data sets, it also runs quite slowly.
What I have tried:
- To use lists in loops to return window partitions, but this is very slow.
- Importing user's functions, such as boRingTrees, which encapsulate the problem well - but are also slow.
What I have learnt:
- There is good support in R for physical partitions (up one row, group into days/weeks, etc) through zoo and rollapply, but limited support for Ranged partitions (all lines within this number of hours from a timestamp).
What I think I need:
I have come to the conclusion that I need a C function to more speedily run a sliding window over a range of dates. I have started playing with C++ in R, and these two Rcpp efforts come close (in technique) to what I think I need:
R: Rolling window function with adjustable window and step-size for irregularly spaced observations
R: fast sliding window with given coordinates
I hope this summary is useful collation of information for people trying to solve similar problems (I found searching on this topic difficult - sparse information and very different ways to describe similar things). Hopefully someone can assist me in building a faster C++ solution I can run in R (inline or .cpp). Here is a sample data set (again, courtesy of mgahan):
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Val <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Val)
e.g:
data[,RowRollCount31:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates", target="Val",lower=0,upper=31,incbounds=T,stat=length,na.rm=T)]
Ideally, the solution would use the 'interval' option as in the Oracle example (i.e windows within 'x' & 'hours' of each row), and also the 'group by'/'by_list' and 'stat' options that mgahan cleverly catered for.
Further reading /a good explanation of the problem:
Many thanks in advance!