R: Efficiently subsetting dataframe based on time of day

Question

I have a large (150,000x7) dataframe that I intend to use for back-testing and real-time analysis of a financial market. The data represents the condition of an investment vehicle at 5 minute intervals (although holes do exist). It looks like this (but much longer):

        pTime     Time  Price       M1       M2        M3        M4
1  1212108300 20:45:00 1.5518 12.21849 -0.37125   4.50549 -31.00559
2  1212108900 20:55:00 1.5516 11.75350 -0.81792  -1.53846 -32.12291
3  1212109200 21:00:00 1.5512 10.75070 -1.47438  -8.24176 -34.35754
4  1212109500 21:05:00 1.5514 10.23529 -1.06044  -8.46154 -33.24022
5  1212109800 21:10:00 1.5514  9.74790 -1.02759 -10.21978 -33.24022
6  1212110100 21:15:00 1.5513  9.31092 -1.17076 -11.97802 -33.79888
7  1212110400 21:20:00 1.5512  8.84034 -1.28428 -13.62637 -34.35754
8  1212110700 21:25:00 1.5509  8.07843 -1.63715 -18.24176 -36.03352
9  1212111000 21:30:00 1.5509  7.39496 -1.49198 -20.65934 -36.03352
10 1212111300 21:35:00 1.5512  7.65266 -1.03717 -18.57143 -34.35754

The data is pre-loaded into R, but during my back-test I need to subset it by two criteria:

The first criteria is a sliding window to avoid peeking into the future. The window must be such that, each new 5 minute interval on the back-test shifts the whole window into the future by 5 minutes. This part I can do like this:

require(zoo)
zooser <- zoo(x=tser$Close, order.by=as.POSIXct(tser$pTime, origin="1970-01-01"))
window(zooser, start=A, end=B)

The second criteria is another sliding window, but one that slides through time of day and contains only those entries that are within N minutes of the input time on any given day.

Example: If the window's size is 2 hours, and the input time is 12:00PM then the window must contain all rows with Time between 10:00AM and 2:00PM

This is the part that I am having trouble figuring out.

Edit: My data has holes in it, two consecutive rows could be MORE than 5 minutes apart. The data looks like this (very zoomed in) enter image description here

As the window moves through these gaps the number of points inside the windows should vary.

The following is my MySQL code that does what I want to do in R (same table structure):

SET @qTime = Time(FROM_UNIXTIME(SAMP_endTime));

SET @inc = -1;
INSERT INTO MetIndListBuys (pTime,ArrayPos,M1,M2,M3,M4)
SELECT pTime,@inc:=@inc+1,M1,M2,M3,M4
 FROM mergebuys USE INDEX (`y`) WHERE  pTime BETWEEN SAMP_startTime AND SAMP_endTime 
AND TIME_TO_SEC(TIMEDIFF(Time,@qTime))/3600 BETWEEN 0-HourSpan AND HourSpan
;

Those index values are nowhere near 12:00PM, at least in my timezone anyway. — IRTFM, Dec 18 '11 at 21:47
Yeah, you are right... I just randomly chose `12:00PM` Bad example I guess. — Mike Furlender, Dec 18 '11 at 23:13

score 3 · Answer 1 · edited May 23 '17 at 12:01

3

1) If DF is the data frame shown in the question then create a zoo object from it as you have done and split it into days giving zs. Then lapply your function f to each successive set of w points in each component (i.e. in each day). For example, if you want to apply your function to 2 hours of data at a time and your data is regularly spaced 5 minute data then w = 24 (since there are 24 five minute periods in two hours). In such a case f would be passed 24 rows of data as a matrix each time its called. Also align has been set to "right" below but it can alternately be set to align="center" and the condition giving ix can be changed to double sided, etc. For more on rollapply see: ?rollapply

library(zoo)
z <- zoo(DF[-2], as.POSIXct(DF[,1], origin = "1970-01-01"))
w <- 3 # replace this with 24 to handle two hours at a time with five min data
f <- function(x) {
            tt <- x[, 1]
            ix <- tt[w] - tt <= w * 5 * 60 # RHS converts w to seconds
            x <- x[ix, -1]
            sum(x) # replace sum with your function
    }
out <- rollapply(z, w, f, by.column = FALSE, align = "right")

Using the data frame in the question we get this:

> out
$`2008-05-30`
2008-05-30 02:00:00 2008-05-30 02:05:00 2008-05-30 02:10:00 2008-05-30 02:15:00 
          -66.04703           -83.92148           -95.93558          -100.24924 
2008-05-30 02:20:00 2008-05-30 02:25:00 2008-05-30 02:30:00 2008-05-30 02:35:00 
         -108.15038          -121.24519          -134.39873          -140.28436

By the way, be sure to read this post .

2) This could alternately be done as the following where w and f are as above:

n <- nrow(DF)
m <- as.matrix(DF[-2])
sapply(w:n, function(i) { m <- m[seq(length = w, to = i), ]; f(m) })

Replace the sapply with lapply if needed. Also this may seem shorter than the first solution but its not much different once you add the code to define f and w (which appear in the first but not the second).

If there are no holes during the day and only holes between days then these solutions could be simplified.

edited May 23 '17 at 12:01

Community

1
1

answered Dec 18 '11 at 21:43

G. Grothendieck

254,981
17
203
341

Thanks for your answer (and the link!) The problem is that my data is not regularly spaced - there can be (and should be) holes in it. I apologize for not making that clear initially. I edited my post to represent that. – Mike Furlender Dec 18 '11 at 23:47
Are the holes just the times between days or are there holes in the days too? From the graph it seems that the holes are only between days and if that is the case then the code above should work. – G. Grothendieck Dec 19 '11 at 01:00
Also I have modified the solution so that it works even if there are holes during the day. – G. Grothendieck Dec 19 '11 at 01:19
The holes are in the days as well. Thanks a ton for your answer, I am new to R so I am going to have to mull it over tomorrow. – Mike Furlender Dec 19 '11 at 05:25
I have made a couple of simplifications. Since the holes can be anywhere there is no point in breaking it into days since the extra code in `f` also handles that so the splitting has been removed. Also the `tail` in `f` has been simplified out. – G. Grothendieck Dec 19 '11 at 11:55
Can you please explain this: w <- 3 # replace this with 24 to handle two hours – Mike Furlender Dec 20 '11 at 00:43
Did you mean "replace this with 2" ? Why does `24` cause it to handle 2 hour differences? Sorry if this is a dumb question.. – Mike Furlender Dec 20 '11 at 00:44
2 hours / 5 minutes = 24 points so if there are no holes we want to look at exactly 24 points but if there are holes then the last 24 points will include some that are more than 2 hours back so check which points are more than 2 hours back, if any, and exclude those. – G. Grothendieck Dec 20 '11 at 01:35

David F · Accepted Answer · 2011-12-19T17:27:47.143

3

Say that you have your target time t0 on the same scale as pTime: seconds since epoch. Then t0 - pTime = (difference in the number of days since epoch between the two) + (difference in remaining seconds). Taking t0 - pTime %% (num. seconds per day) will leave us with the difference in seconds in clock arithmetic (wrapped around if the difference is negative). This suggests the following function:

SecondsPerDay <- 24 * 60 * 60
within <- function(d, t0Sec, wMin) {
  diff <- (d$pTime - t0Sec) %% SecondsPerDay
  wSec <- 60 * wMin
  return(d[diff < wSec | diff > (SecondsPerDay - wSec), ])
}

edited Dec 19 '11 at 17:27

answered Dec 19 '11 at 10:06

David F

1,255
9
12

I am trying to understand this solution but I am new to R - can you please explain what is going on in the `return` statement? – Mike Furlender Dec 20 '11 at 05:06
For a vector of booleans x, d[x, ] returns a new data.frame with just the rows of d for which x has TRUE. "|" is the vector "OR" operator with the ith entry in its result TRUE if the ith entry of at least one of its arguments was TRUE. For example, the ith entry in "diff < wSec | diff > (SecondsPerDay - wSec)" will be TRUE if "diff[i] < wSec" OR "diff[i] > (SecondsPerDay - wSec)". (Since wSec has length one, which is less than the length of diff, R will "recycle" that one value for each value in diff when carrying out the "<" operation, same for (SecondsPerDay - wSec).) Hope this helps. – David F Dec 20 '11 at 08:28
And specifically, diff > (SecondsPerDay - wSec) handles cases where the diff was actually "negative" but got wrapped around by the clock arithmetic. – David F Dec 20 '11 at 08:33
Got it! Fast,simple, and intuitive. Thanks a lot! – Mike Furlender Dec 20 '11 at 23:11

R: Efficiently subsetting dataframe based on time of day

2 Answers2

Linked