1

My data is 988, 785 obs. of 3 variables. A smaller example of my data is below:

Names <- c("Jack", "Jill", "John")
RawAccelData <- data.frame(
  Sample = as.numeric(rep(1:60000, each = 3)),
  Acceleration = rnorm(6000),
  ID = rep((Names), each = 60000)
)

The sample rate of my equipment is 100 Hz. I wish to calculate a rolling average of Acceleration for each ID over a 1 to 10 second period. I perform this using the following:

require(dplyr)
require(zoo)

for (summaryFunction in c("mean")) {
  for ( i in seq(100, 1000, by = 100)) {
    tempColumn <- RawAccelData %>%
      group_by(ID) %>%
      transmute(rollapply(Acceleration,
                          width = i, 
                          FUN = summaryFunction, 
                          align = "right", 
                          fill = NA, 
                          na.rm = T))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    RawAccelData <- bind_cols(RawAccelData, tempColumn[2])
  }
}

However, I now need to calculate a rolling over a 1 to 10 minute period. I can do this by using the above code and substituting in the following line:

for ( i in seq(6000, 60000, by = 6000)) {

However, this takes hours to run through my dataset and results in RStudio on my Mac (details below) hanging! Is there a way I can a) tidy up the above code or b) use a different package/ method to enable a quicker result?

Thank you.

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] zoo_1.7-12  dplyr_0.4.3

loaded via a namespace (and not attached):
 [1] lazyeval_0.1.10 magrittr_1.5    R6_2.1.1        assertthat_0.1  parallel_3.2.3  DBI_0.3.1      
 [7] tools_3.2.3     Rcpp_0.12.2     grid_3.2.3      lattice_0.20-33
user2716568
  • 1,866
  • 3
  • 23
  • 38

2 Answers2

3

The reason it is running slowly is that

  1. the code in the question has defeated rollapply's ability to detect that mean is being passed by assigning mean to a variable and passing that variable. (In the case of mean, rollapply calls rollmean which contains optimized code for that case). Had the code in the question passed mean directly or had it used rollmean it would have been substantially faster.

  2. filter does not remove NAs so for an apples to apples comparison one should not use na.rm = TRUE in rollapply. If you do use it then it will also defeat the optimization.

For example, in this comparison rollapply runs more than twice as fast as filter:

library(zoo)
library(rbenchmark)

set.seed(123)
r <- rnorm(10000)
benchmark(filter = stats::filter(r, rep(1/100,100), sides = 1),
          rollapply = rollapplyr(r, 100, mean, fill = NA))[1:4]

giving:

       test replications elapsed relative
1    filter          100    3.75    2.119
2 rollapply          100    1.77    1.000

The speed may, of course, vary according to the width, data length and other aspects of the input since this is only one test.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I appreciate the context to why it is running slowly, this will assist me with creating code in the future. Thank you! – user2716568 Mar 15 '16 at 22:16
1

I'm not sure if you have other summary functions in mind, but at least for the mean, you can speed up the rollapply function by using filter instead: transmute(stats::filter(Acceleration,rep(1/i,i),sides=1))

(See other options here: Calculating moving average in R) Using system.time, this sped me up from 117 secs to 4 secs!!

You can also do some for loops in parallel. Instead of

for ( i in seq(6000, 60000, by = 6000)) {

try:

library(parallel)
for (summaryFunction in c("mean")) {
  rollCols = mclapply (seq(100, 1000, by = 100),function(i){
    tempColumn <- RawAccelData %>%
    group_by(ID) %>%
    transmute(stats::filter(Acceleration,rep(1/i,i),sides=1))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    return(tempColumn[2])
  })
}

RawAccelData = cbind(RawAccelData,do.call(cbind,rollCols))

This sped me up from 72 sec to 40 sec, but it depends on how many cores your computer has.

Community
  • 1
  • 1
user20061
  • 444
  • 6
  • 12