I am trying to calculate a moving sum over some timeseries values. However, the data is huge. I am not sure what is the fastest way to do that actually.
Here is what I tried:
- use
data.tables
andfilter
sapplying
, which can be parallelized using theforeach
package. But I think there should be a neater way to do that
Following is the code sample:
set.seed(12345)
library(dplyr)
library(data.table)
# Generate random data
ts = seq(from = as.POSIXct(1447155253, origin = "1970-1-1"), to = as.POSIXct(1447265253, origin = "1970-1-1"), by ="min")
value = sample(1:10, length(ts), replace = T)
sampleDF = data.frame(timestamp = ts, value = value )
sampleDF = as.data.table(sampleDF)
# Pre-manipulations
slidingwindow = 5*60 # 5 minutes window
end.ts = sampleDF$timestamp[length(sampleDF$timestamp)] - slidingwindow
end.i = which(sampleDF$timestamp >= end.ts)[1]
# Apply rolling sum
system.time(
sapply( 1:end.i,
FUN = function(i) {
from = sampleDF$timestamp[i] # starting point
to = from + slidingwindow # ending point
sum = filter(sampleDF, timestamp >= from, timestamp < to) %>% .$value %>% sum # Filter and sum
return( sum)
})
)
# user system elapsed
# 5.60 0.00 5.69