-1

Disclaimer

Hello everyone! I recently started programming in R. My codes are working just fine, but in terms of speed some of them are taking way too long to put to good use. I hope someone can help me making this code run faster either by optimising the code, or with the use of one of the multicore packages.


About the code

I have large datasets containing about 15000 numeric data each. The code takes two parameters (p, n) where p >= n, and make subsets of the data. It applies the zyp.yuepilon function (from the zyp package) to each row of the subsets. Then the parameter n is used to apply the same function on an n sized subset.

Problem is I run this code in a nested for loop: p in 10:40 and n in 10:40 so it takes an eternity to get the results, and it's just one dataset among many others.

sp <- function(p, n){
  library(zyp)

  data <- runif(15000, 1, 4)
  lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
  upper <- lower + p - 1

  subsets <- matrix(nrow=length(lower), ncol=p)
  for(j in 1:length(lower)){
    subsets[j, ] = data[lower[j] : upper[j]]
  }

  ret <- apply(subsets, 1, zyp.yuepilon)

  subset_n <- subsets[, 1:n]
  ret2 <- apply(subset_n, 1, zyp.yuepilon)
  return(list(ret, ret2))
}

Benchmark results in seconds:

expr      min       lq   median       uq      max neval
sp(7, 6) 92.77266 94.24901 94.53346 95.10363 95.64914    10
John_Slinger
  • 51
  • 1
  • 3
  • 3
    The first step if code is too slow is profiling. See `help("Rprof")`. – Roland Aug 21 '14 at 11:32
  • ^ Yes, you need to find out where the time is spent to avoid premature optimization. – martin Aug 21 '14 at 11:32
  • However, I suspect that `zyp.yuepilon` is your bottleneck. Write a faster implementation or use parallelization. – Roland Aug 21 '14 at 11:33
  • 4
    This question appears to be off-topic because it is about optimizing working code; it's a better fit for codereview.stackexchange.com. – josliber Aug 21 '14 at 12:43
  • Thanks for the comments, I really appreciate them. I used profiling, and zyp.yuepilon is the bottleneck as Roland wrote. I tried to use the snowfall package, but didn't really get faster. It really is off-topic, thanks for pointing it out. I will pay more attention in the future. – John_Slinger Aug 21 '14 at 16:16
  • The author of the original post does clearly state that they "recently started programming in R". Perhaps the question could be phrased more accurately, by someone with more experience, as: "what is the best, or alternative, loop methods when calling a non-optimised function. I've tried using `apply`, here is my code." Not sure what exactly is off-topic here - perhaps remove the zyp library, and provide a generic function. – Rusan Kax Aug 22 '14 at 01:18

1 Answers1

1

Here is a series of comments, rather than an answer.

Looking at the zyp.yuepilon function body, by calling the function without parenthesis in a R session, you see that this function, and the function zyp.sen are written in plain R code (as opposed to compiled code).

The biggest speed-up is likely attained by using the Rcpp package which facilitates calling (compiled) C++ code within R. In fact, there is a small linear model example here Fast LM model using Rcpp/RcppArmadillo.

I would be inclined to rewrite the two functions zyp.yuepilon and zyp.sen in C++, using Rcpp, including the loop over subset vectors (for which you are currently using apply to do).

For general R speed-up issues see this question R loop performance, as well as the R package plyr, which may provide an entry point for taking a map-reduce type of approach to your problem.

If you want to steer clear of C++, then a series of micro-optimisations would be your quickest win. To speed up the apply aspect of your code, you could use something like this

library(doParallel)
library(parallel)
library(foreach)
library(zyp)

cl<-makeCluster(4)
registerDoParallel(cl)

sp_1<-function(p=7, n=6){
N_ob=15000; 
off_set=81; 
N_ob_o=N_ob-off_set; 

am<-matrix(runif(N_ob*p),ncol=p); 
subsets<-am[-(1:off_set),];
ret=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subsets[i,]),use.names=FALSE),ncol=11, byrow=TRUE);

subset_n <- subsets[, 1:n]
ret2=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subset_n[i,]),use.names=FALSE),nrow=11);
return(list(ret, ret2))
}

sp<-function(p=7, n=6){

data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1

subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}

ret <- apply(subsets, 1, zyp.yuepilon)

subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}

system.time(sp_1())
system.time(sp())

This gives me a speed-up of around a factor of 2. But this will depend on your platform, etc. Check out the help files for the functions and packages above, and tune the number of clusters using makeCluster to see what works best for your platform (in the absence of any information about your particular set-up).

Another route might be to make use of the byte-code compiler via library(compiler) to see if the various functions can be optimised, this way.

library(compiler)
enableJit(3);
zyp_comp<-cmpfun(zyp.yuepilon);
Community
  • 1
  • 1
Rusan Kax
  • 1,894
  • 2
  • 13
  • 17