1

I am trying to calculate a moving sum over some timeseries values. However, the data is huge. I am not sure what is the fastest way to do that actually.

Here is what I tried:

  1. use data.tables and filter
  2. sapplying, which can be parallelized using the foreach package. But I think there should be a neater way to do that

Following is the code sample:

set.seed(12345)
library(dplyr)
library(data.table)

# Generate random data
ts = seq(from = as.POSIXct(1447155253, origin = "1970-1-1"), to =     as.POSIXct(1447265253, origin = "1970-1-1"), by ="min")
value = sample(1:10, length(ts), replace = T)
sampleDF = data.frame(timestamp = ts, value = value )
sampleDF = as.data.table(sampleDF)


# Pre-manipulations 
slidingwindow = 5*60 # 5 minutes window
end.ts = sampleDF$timestamp[length(sampleDF$timestamp)] - slidingwindow 
end.i = which(sampleDF$timestamp >= end.ts)[1] 


# Apply rolling sum 

system.time(
  sapply( 1:end.i,         
        FUN = function(i) { 
          from = sampleDF$timestamp[i] # starting point
          to = from + slidingwindow # ending point 
          sum = filter(sampleDF, timestamp >= from, timestamp < to) %>% .$value %>% sum   # Filter and sum     
          return( sum)
        })
)

# user  system elapsed 
# 5.60    0.00    5.69 
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
  • 5
    Hmm, isn't this just `RcppRoll::roll_sum(value, 5L)` without the last element? – David Arenburg Nov 10 '15 at 15:14
  • Yep.. Thanks David. First time to learn about that package. I guess the fastest way could be parallel implementation for the function you mentioned. Or what do you think? –  Nov 10 '15 at 15:33
  • Don't think parallel will improve this. Did you benchmark on your real data? Should be efficient on big sizes too. – David Arenburg Nov 10 '15 at 15:39
  • Not yet, David. Thanks for the suggestion! –  Nov 10 '15 at 16:39

0 Answers0