3

Below is the piece of code. It gives percentile of the trade price level for rolling 15-minute(historical) window. It runs quickly if the length is 500 or 1000, but as you can see there are 45K observations, and for the entire data its very slow. Can I apply any of the plyr functions? Any other suggestions are welcome.

This is how trade data looks like:

> str(trade)
'data.frame':   45571 obs. of  5 variables:
 $ time    : chr  "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00.811" "2013-10-20 22:00:00.811" ...
 $ prc     : num  121 121 121 121 121 ...
 $ siz     : int  1 4 1 2 3 3 2 2 3 4 ...
 $ aggress : chr  "B" "B" "B" "B" ...
 $ time.pos: POSIXlt, format: "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00.811" "2013-10-20 22:00:00.811" ...

And this is how the data looks like after the new column trade$time.pos

trade$time.pos <- strptime(trade$time, format="%Y-%m-%d %H:%M:%OS") 

> head(trade)
                     time      prc siz aggress                time.pos
1 2013-10-20 22:00:00.489 121.3672   1       B 2013-10-20 22:00:00.489
2 2013-10-20 22:00:00.807 121.3750   4       B 2013-10-20 22:00:00.807
3 2013-10-20 22:00:00.811 121.3750   1       B 2013-10-20 22:00:00.811
4 2013-10-20 22:00:00.811 121.3750   2       B 2013-10-20 22:00:00.811
5 2013-10-20 22:00:00.811 121.3750   3       B 2013-10-20 22:00:00.811
6 2013-10-20 22:00:00.811 121.3750   3       B 2013-10-20 22:00:00.811

#t_15_index function returns the indices of the trades that were executed in last 15 minutes from the current trade(t-15 to t).
t_15_index <- function(data_vector,index) {
  which(data_vector[index] - data_vector[1:index]<=15*60)
}

get_percentile <- function(data) {
  len_d <- dim(trade)[1]  

  price_percentile = vector(length=len_d)  

  for(i in 1: len_d) {   

    t_15 = t_15_index(trade$time.pos,i)
    #ecdf(rep(..)) gets the empirical distribution of the the trade size on a particular trade-price level
    price_dist = ecdf(rep(trade$prc[t_15],trade$siz[t_15]))
    #percentile of the current price level depending on current (t-15 to t) subset of data
    price_percentile[i] = price_dist(trade$prc[i])
  }
  trade$price_percentile = price_percentile
  trade
}


res_trade = get_percentile(trade)
Neerav
  • 1,399
  • 5
  • 15
  • 25
  • 1
    `get_percentile` takes an input, `data`, and doesn't reference `data` within the function – rawr Jan 11 '14 at 16:04
  • 5
    If you have some time to learn something you you can look at the `data.table` package, it implements a rolling join with the argument `roll`, I am pretty sure that would be multiple times quicker than using `plyr` – statquant Jan 11 '14 at 16:57
  • This answer (http://stackoverflow.com/questions/20134823/r-faster-way-to-calculate-rolling-statistics-over-a-variable-interval/20137464#20137464z) should give you some ideas if you want to try Rcpp. You'll have to figure out `ecdf` in C++. After you munge your data correctly, the compiled function would probably work in milliseconds. – kdauria Jan 29 '14 at 16:36
  • Also, if anyone is looking for a fast function to find points within an interval, check out `findInterval`. – kdauria Feb 11 '14 at 13:44

4 Answers4

2

There may be a way to accelerate the rolling application, but due to the changing window size I think the standard tools (e.g. rollapply) don't work, though perhaps some with more familiarity with them will have ideas. In the meantime, you can optimize your percentile calculation. Instead of using ecdf which creates a function with all the associated overhead, you can calculate a decent approximation directly:

> vec <- rnorm(10000, 0, 3)
> val <- 5
> max(which(sort(vec) < val)) / length(vec)
[1] 0.9543
> ecdf(vec)(val)
[1] 0.9543
> microbenchmark(max(which(sort(vec) < val)) / length(vec))
Unit: milliseconds
expr      min       lq   median       uq      max neval
max(which(sort(vec) < val))/length(vec) 1.093434 1.105231 1.116364 1.141204 1.449141   100
> microbenchmark(ecdf(vec)(val))
Unit: milliseconds
expr      min       lq   median       uq      max neval
ecdf(vec)(val) 2.552946 2.808041 3.043579 3.439269 4.208202   100

Roughly 2.5x improvement. The improvement is greater for smaller samples.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
2

Allright, this question got me interested. These are the things I did:

  1. replacing ecdf with the custom percentile computation
  2. changing time.pos to numeric (since it's in seconds anyway), as there is additional overhead from [.POSIXct vs [
  3. changed t_15_index to only look back as far as the previous earliest time stamp, since the data is sorted so we don't need to look back all the way to index 1.

And this is the outcome:

> system.time(res2 <- get_percentile2(trade))
  user  system elapsed 
14.458   0.768  15.215 
> system.time(res1 <- get_percentile(trade))
   user  system elapsed 
110.851  17.974 128.736 

And demonstrating the outputs are the same:

tail(sort(abs(res1$price_percentile - res2$price_percentile)), 50)
#  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [45] 0 0 0 0 0 0

Approx 8.5x improvement. Note this improvement is much greater if you have fewer items per 15 minute interval. This is cramming 45K points in 24 hours. So if your 45K are actually over a few days you can expect more improvement. Here is the code:

t_15_index2 <- function(data_vector,index, min.index) {
  which(data_vector[index] - data_vector[min.index:index]<=minutes*60) + min.index - 1L
}
get_percentile2 <- function(trade) {
  len_d <- dim(trade)[1]  
  price_percentile = vector(length=len_d)
  min.index <- 1  
  for(i in 1: len_d) {
    t_15 = t_15_index2(trade$time.pos.2,i, min.index)
    vec <- rep(trade$prc[t_15],trade$siz[t_15])
    price_percentile[i] <- max(0, which(sort(vec) <= trade$prc[i])) / length(vec)
    min.index <- t_15[[1]]
  }
  trade$price_percentile = price_percentile
  trade
}

And here is the data

start <- as.numeric(as.POSIXct("2013-01-01"))
end <- as.numeric(as.POSIXct("2013-01-02"))
set.seed(1)
minutes <- 15
ticks <- 45000
times <- sort(as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01")))
trade <- data.frame(
  time=as.character(times),
  prc=100 + rnorm(ticks, 0, 5),
  siz=sample(1:10, ticks, rep=T),
  time.pos=times,
  time.pos.2=as.numeric(times)
)

And finally, the original functions slightly modified, but basically the same:

t_15_index <- function(data_vector,index) {
  which(data_vector[index] - data_vector[1:index]<=minutes*60)
}
get_percentile <- function(trade) {
  len_d <- dim(trade)[1]    
  price_percentile = vector(length=len_d)  
  for(i in 1: len_d) {       
    t_15 = t_15_index(trade$time.pos,i)
    price_dist = ecdf(rep(trade$prc[t_15],trade$siz[t_15]))
    price_percentile[i] = price_dist(trade$prc[i])
  }
  trade$price_percentile = price_percentile
  trade
}

Also, ddply will definitely not help with this. I also don't think data.table will help too much because the main lookup here on the large vector is done with integer indices. It could help a little. I may check that out later.


EDIT: there is one more way you could probably get some fairly significance performance improvements is to update your t_15 vector 1 at a time instead of re-selecting it wholesale every iteration. In other words, find the next time value, and drop the values that are not within 15 minutes in your t_15 vector. Doing this is fairly complex in your case since your 15 minute window varies in how many items it can contain. You should size your baseline t_15 vector be large enough to contain most windows, and anytime you encounter a window larger, then expand your vector (with the assumption this won't happen often). I'm not going to do this because dealing with the changing window size actually requires a fair bit of book-keeping, but to give you an idea, here is an example where we loop through a 1MM vector with a 2K window:

# Version that pulls whole 2000 entries each time
sub.vec <- numeric(2000)
system.time(r1 <- for(i in seq_len(length(vec) - 2000)) sub.vec <- vec[i:(i+1999)])
#  user  system elapsed 
# 17.507   4.723  22.211 

# Version that overwrites 1 value at a time keeping the most recent 2K
sub.vec <- numeric(2001) # need to make this slightly larger because of 2000 %% 2000 == 0
system.time(r2 <- for(i in seq_len(length(vec) - 2000)) sub.vec[[(i %% 2000) + 1]] <- vec[[i]])

#  user  system elapsed 
# 2.642   0.009   2.650 

all.equal(r1, tail(r2, -1L))
# [1] TRUE

Finally, if you do it this way, you may also be able to come up with clever ways to re-calculate percentiles very quickly given you know what your 15 minute contained, what was added, and what was removed.

Not 100% sure if the bookkeeping required to do the FIFO 15 minute window would end up overcoming the benefits of doing it though.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • I suspect you might be able to do better if you calculated all `t_15` values in one sweep - assuming it's sorted you should only need to do a single pass through the data – hadley Jan 14 '14 at 13:38
  • @hadley, do you mean storing all the possible rolling 15 minute intervals (I'm estimating there are ~20MM of them)? And only then operating on them? Right now both the original approach and my adjustments sweep through the data once. – BrodieG Jan 14 '14 at 13:50
  • See my answer. You're sweeping through multiple times in `t_15_index` (although it's not the whole vector it's still fairly inefficient) – hadley Jan 14 '14 at 13:55
2

Here's a quick stab at more efficiently finding the index of the time that occurred fifteen minutes ago:

# Create some sample data (modified from BrodieG)
set.seed(1)

ticks <- 45000
start <- as.numeric(as.POSIXct("2013-01-01"))
end <- as.numeric(as.POSIXct("2013-01-02"))
times <- as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01"))
trade <- data.frame(
  time = sort(times),
  prc = 100 + rnorm(ticks, 0, 5),
  siz = sample(1:10, ticks, rep = T)
)

# For vector of times, find the index of the first time that was at least
# fifteen minutes before the current time. Assumes times are sorted
minutes_ago <- function(time, minutes = 15) {
  secs <- minutes * 60
  time <- as.numeric(time)
  out <- integer(length(time))

  before <- 1

  for(i in seq_along(out)) {
    while(time[before] < time[i] - secs) {
      before <- before + 1
    }
    out[i] <- before

  }
  out
}
system.time(minutes_ago(trade$time))
# Takes about 0.2s on my machine

library(Rcpp)
cppFunction("IntegerVector minutes_ago2(NumericVector time, int minutes = 15) {
  int secs = minutes * 60;
  int n = time.size();
  IntegerVector out(n);

  int before = 0;
  for (int i = 0; i < n; ++i) {
    # Could do even better here with a binary search
    while(time[before] < time[i] - secs) {
      before++;
    }
    out[i] = before + 1;
  }
  return out;
}")

system.time(minutes_ago2(trade$time, 10))
# Takes less than < 0.001

all.equal(minutes_ago(trade$time, 15), minutes_ago2(trade$time, 15))
hadley
  • 102,019
  • 32
  • 183
  • 245
  • I see what you mean. The t_15 index construction step takes about 1.3 seconds (just running `t_15_index2`), so this saves about a second in the non-Rcpp version. Unfortunately that's small compared to the total run time of 17 seconds (this is relative to the machine I'm on now, not the same one I wrote the original code on). + 1 for the education. – BrodieG Jan 14 '14 at 14:17
  • @BrodieG I think for really fast code you could use the same strategy to compute a rolling ecdf, but that would be quite a lot more complex – hadley Jan 14 '14 at 14:38
0

If you want to use ecdf within dplyr, use seq_along/sapply within mutate to get faster results

y <- x %>% mutate(percentile.score = sapply(seq_along(score), function(i){sum(score[1:i] <= score[i])/i}))
ishonest
  • 433
  • 4
  • 8