0

I've been using gregmisc library to perform a rolling decile ranking.

Let's say I have vector 'X' of 1000 continuous value and I apply my function with a look back window of 250 (which is what I use).

My current function works as follows: The first 250 records will be values between 1 & 10. Then the next record 251, will be determined by the values from c(2:251), then repeats for c(3:252), etc...

While it does the trick faster than a loop, the performance of using gregmisc's "running" function for my decile ranking function has much to be desired.

I've been working on speeding up my functions by operating over the entire time series by creating basically columns of information that I would need at that time but I have not come up with a similar solution for this problem like I have for others. When I used this method, I've reduced processing time by as much as 95%.

Matrices may work more quickly but I haven't seen it done well enough to beat my running version.

Any ideas?

Thanks!


Here is the code I'm using: 1 core function then a function that uses rolling from Greg misc:

F_getDecileVal <- function( x, deciles=0.1) {
    len<-length(x)
    y <- array(0,dim=len)   
    deciles <- seq(0,1,deciles)
    decileBounds <- quantile( x ,deciles, na.rm=TRUE)
        lendecile <- length(decileBounds)                  
        for( i in 2 : lendecile) {
            y[ which( x <= decileBounds[[i]] & x >= decileBounds[[i-1]] ) ] <- (i - 1)
        }   
    #Reverse Order so top decile has largest values
    dec6 <- which(y==6); dec7 <- which(y==7); dec8 <- which(y==8); dec9 <- which(y==9); dec10 <-which(y==10);
    dec1 <- which(y==1); dec2 <- which(y==2); dec3 <- which(y==3); dec4 <- which(y==4); dec5 <-which(y==5);
    y[dec1]<-10; y[dec2]<-9; y[dec3]<-8; y[dec4]<-7; y[dec5]<-6; y[dec6]<-5; y[dec7]<-4; y[dec8]<-3; y[dec8]<-3; y[dec9]<-2; y[dec10]<-1;

    return(y)
}

And the rolling function:

F_getDecileVal_running <- function(x, decilecut=0.1,interval){
    len<-length(x)
      #Modified by ML 5/4/2013
      y <- array(NA, dim=len)
    if(len >= interval){
        y <- running(x, fun=F_getDecileVal, width=interval,records=1, pad=TRUE,simplify=TRUE)   
        y[1:interval] <- F_getDecileVal(x[1:interval])
    }
    return(y)
}
# system.time(F_getDecileVal_running(mydata[,8],interval=250))
# > dim(mydata)
# [1] 5677    9
#user  system elapsed 
#   4.28    0.00    4.38 
ML_for_now
  • 63
  • 6
  • It would be helpful if you provided a small example, the function that you are currently using, along with timings, since you are looking to speed it up. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – matt_k Jun 08 '14 at 02:43
  • Apologies, you are so right. Added. – ML_for_now Jun 09 '14 at 14:14

2 Answers2

2

If you can accept using a version of 'decile' that is not the one used by default in R's quantile function (but is one of the possible choices I think type=6), then you can probably just use sort and extract the 26th, 51st, 76th, ... etc to either 226th or 250th items depending on whether you also want the min and max vs. just the inner decile "hinges". The rollapply function in the zoo-package is designed for rolling function application and I think will probably be more useful in the long run than gregmisc::running since it is part of a suite of functions for time series. This more minimal example returns just min, max and median for a simple set:

x <- 1:1000
require(zoo)
rollapply(x[1:300], 250, function(x) sort(x)[ c(1, 125, 250) ] )
      [,1] [,2] [,3]
 [1,]    1  125  250
 [2,]    2  126  251
 [3,]    3  127  252
 [4,]    4  128  253
 [5,]    5  129  254
 [6,]    6  130  255
 [7,]    7  131  256
 snipped the rest of the 50 lines of the output matrix.
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Any way I can see the first part as well? Apologies, but I'm not sure I'm understanding the decile type =6 part. Thank you – ML_for_now Jun 09 '14 at 14:32
  • Decile points can be either the last value in the lowest 10%, or the next value, or some interpolated value between the lowest 10% and the next value. Read ?quantile to see the full list of quantile 'types'. My method still requires a sort (which is fairly costly computationally for large datasets) but does not need the interpolation. – IRTFM Jun 09 '14 at 20:27
  • I'll check that out. thanks again. I like the sort suggestion because with <5000 records, which will have the highest frequency density in my data, it goes faster than my current. I'll just have to brush up on ?quantile and check type = 6. Thanks again. – ML_for_now Jun 10 '14 at 01:26
  • Well the 'sort' method is much more time costly than what I have already. I'm looking for any ideas how I can speed this up or is there a more "global" approach versus rolling. – ML_for_now Jun 14 '14 at 23:11
  • You could keep a running vector of decile split points and then only need to replace within a single 1/10th segment of values. you would only need to sort when there was a hit at one of the boundaries. Otherwise you would just need to replace a single value with each iteration. – IRTFM Jun 15 '14 at 07:24
  • Interesting. Recently I've modified the code to not do anything more in the "rolling" stage other than to return the ordinal ranking of each record. Then I assign the decile ranking in one shot as the ranking will related to some boundary of the ranking between 1 & length(x). I've reduced my time from the low 4's above to the mid 2's. However, as you said, the sort IS the cost and therefore the decile (output) may be the quicker application during the "roll" over the vector. Thanks I'll try that as well. – ML_for_now Jun 17 '14 at 00:49
1
rolling_decile <-  function(i, v, window){
  v_s  <- v[i:(i + window - 1)]
  deciles <- cut(v_s,  
                 breaks = quantile(v_s, probs = seq(0, 1, by=0.1)),  
                 include.lowest = TRUE,
                 labels = 1:10)
  }

get_deciles  <- function(x, window){
        l  <- lapply(1:(length(x) - window + 1), rolling_decile, x, window)
        v  <- c(l[[1]], unlist(lapply(2:length(l), function(x) l[[x]][window])))
        }

x <- 1:1000
window <- 250 
d <-  get_deciles(x, window)

Given your question not sure how fast your current function is:

library(microbenchmark)
microbenchmark( 
  FUN = {
        get_deciles(x, window)
        })

#Unit: milliseconds
# expr      min       lq   median      uq     max neval
#  FUN 233.0379 242.6611 246.1712 249.682 309.985   100
matt_k
  • 4,139
  • 4
  • 27
  • 33
  • Since we don't have access to the dataset that you are using, we can't compare to your timings. Like I said in my comment to your original post, it really helps if you create a small example. I was a little unsure if what i provided even did what you wanted, so an example would help there as well. See the link above for more details. I'm not familiar with `gregmisc` and it looks like it was removed from CRAN? – matt_k Jun 09 '14 at 19:00
  • yes, you're right on all counts. I just pulled down pricing data from Yahoo for this example and used close prices since 1990-01-01 for IBM. Sorry I didn't include that rather important detail earlier. :( Yes, Gregmisc was rolled up into another library. Was it Zoo? I forget but I know it was absorbed by another library. – ML_for_now Jun 10 '14 at 01:23