2

I need to do previous tick aggregation on my tick data set for 5 minute intervals. Please note what I want to do is analogous to aggregateTrades() function in highfrequency package. But I need to solve this problem without using highfrequency package due to some other data handling issues. This is my data set:

dput(tt)
structure(c(1371.25, NA, 1373.95, NA, NA, 1373, NA, 1373.95, 
1373.9, NA, NA, 1374, 1374.15, NA, 1374, 1373.85, 1372.55, 1374.05, 
1374.15, 1374.75, NA, NA, 1375.9, 1374.05, NA, NA, NA, NA, NA, 
NA, NA, 1375, NA, NA, NA, NA, NA, 1376.35, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 1376.25, NA, 1378, 1376.5, NA, NA, NA, 1378, 
1378, NA, NA, 1378.8, 231.9, 231.85, NA, 231.9, 231.85, 231.9, 
231.8, 231.9, 232.6, 231.95, 232.35, 232, 232.1, 232.05, 232.05, 
232.05, 231.5, 231.3, NA, NA, 231.1, 231.1, 231.1, 231, 231, 
230.95, 230.6, 230.6, 230.7, 230.6, 231, NA, 231, 231, 231.45, 
231.65, 231.4, 231.7, 231.3, 231.25, 231.25, 231.4, 231.4, 231.85, 
231.75, 231.5, 231.55, 231.35, NA, 231.5, 231.5, NA, 231.5, 231.25, 
231.15, 231, 231, 231, 231.05, NA), .indexCLASS = c("POSIXct", 
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "Asia/Calcutta", tzone = "Asia/Calcutta", index = structure(c(1459481850, 
1459482301, 1459482302, 1459482303, 1459482304, 1459482305, 1459482306, 
1459482307, 1459482309, 1459482310, 1459482311, 1459482312, 1459482314, 
1459482315, 1459482316, 1459482317, 1459482318, 1459482319, 1459482320, 
1459482321, 1459482322, 1459482323, 1459482324, 1459482326, 1459482328, 
1459482329, 1459482330, 1459482331, 1459482332, 1459482336, 1459482337, 
1459482338, 1459482339, 1459482342, 1459482344, 1459482346, 1459482347, 
1459482348, 1459482349, 1459482350, 1459482351, 1459482354, 1459482355, 
1459482356, 1459482357, 1459482358, 1459482359, 1459482362, 1459482363, 
1459482364, 1459482369, 1459482370, 1459482371, 1459482372, 1459482373, 
1459482378, 1459482379, 1459482380, 1459482382, 1459482388), tzone = "Asia/Calcutta", tclass = c("POSIXct", 
"POSIXt")), .Dim = c(60L, 2L), .Dimnames = list(NULL, c("A", 
"B")), class = c("xts", "zoo"))

This is my code for previous tick aggregation:

ag.5min.tt<-tt%>%filter(as.Date(index(tt)))%>%lapply(aggregate(by=cut(format(index(tt), format = "%H:%M:%S"), breaks = "5 mins", Fun=tail)))

What I am trying to do with the above code is to make 5 minute intervals each day for the prices of A and B. But I am getting error. Please suggest how to fix this error:

Error in UseMethod("filter_") : 
  no applicable method for 'filter_' applied to an object of class "c('xts', 'zoo')" 

Thanks.

Edit: Converting the xts object to dataframe:

tt<-as.data.frame(tt)
tt<-data.frame(Time=rownames(tt), coredata(tt))
ag.5min.tt<-tt%>% filter(as.Date(index(tt)))%>%lapply(aggregate(by=cut(format(index(tt), format = "%H:%M:%S"), breaks = "5 mins", Fun=tail)))

New error:

Error in eval(substitute(expr), envir, enclos) : 
  filter condition does not evaluate to a logical vector. 

Edit: attempt:

tt$Time<- as.POSIXct(tt$Time, format="%Y-%m-%d %H:%M:%S")
ag.5min.tt<-tt%>% group_by(Time==as.Date(tt$Time))%>%lapply(aggregate(by=cut(format(tt$Time, format = "%H:%M:%S"), breaks = "5 mins", Fun=tail)))

Error:

Error in cut.default(format(tt$Time, format = "%H:%M:%S"), breaks = "5 mins",  : 
  'x' must be numeric
In addition: Warning message:
In eval(substitute(expr), envir, enclos) :
  Incompatible methods ("Ops.POSIXt", "Ops.Date") for "=="

The result would look like this. Every five minute time stamp will have values for that particular time stamp or if there is an NA that time stamp will have last non- NA value for the stocks A and B

  time                  A      B
1 2016-04-01 09:00:00      NA    NA
2 2016-04-01 09:05:00      NA    NA
3 2016-04-01 09:10:00      NA    NA
4 2016-04-01 09:15:00 1371.25 231.90
5 2016-04-01 09:20:00 1376.35 231.55
runjumpfly
  • 319
  • 1
  • 10
  • 1
    You are applying the dplyr methods to an xts object – akrun Oct 03 '16 at 12:29
  • @akrun I have updated my question as per your suggestion. Please review – runjumpfly Oct 03 '16 at 12:42
  • Now, if you look at the `filter` statement there is no condition. i.e `filter(as.Date(Time))` – akrun Oct 03 '16 at 12:45
  • What should I do here if I want to group rows by days in my Time column. Sadly my Time (POSIXct) now became Factor due to dataframe conversion. – runjumpfly Oct 03 '16 at 12:49
  • In that case just do `group_by(Date = as.Date(Time))` – akrun Oct 03 '16 at 12:55
  • Why is your aggregation function `tail`? What is your expected output? – aichao Oct 03 '16 at 13:21
  • @aichao I want to pick row every 5 mins. That is if we begin at 9.00 then 9.00, 9.05, 9.10 and so on – runjumpfly Oct 03 '16 at 13:23
  • OK, so you want `:05`, `:10`, etc. no matter what the hour is? Also, what if there are multiple measurements (in seconds) between `:05` and `:06`? Do you want the first? – aichao Oct 03 '16 at 13:39
  • Yes. So from 9.00 to 16.00, there will be 85 five minute rows. The row that is previously closest to the interval will be picked. That is, out of 9.03.4, 9.04.32, 9.04.54, the previous closest 9.04.54 will be picked for the interval 9.00 to 9.05. – runjumpfly Oct 03 '16 at 13:46
  • Is this a school assignment? – Paul Rougieux Oct 05 '16 at 07:23
  • Sorry for asking but the example you have has such a nice combination of issues (missing observation over one interval for b, missing observation over a full interval 9h05-9h10 for both variables) that I thought it was conceived as a teaching exercise. – Paul Rougieux Oct 05 '16 at 08:04

2 Answers2

2

You can use .indexmin to index your time-series by minutes and then manipulate that index to subset observations:

ind <- which(diff(.indexmin(tt) %% 5) == -4)
res <- tt[ind]

Here, .indexmin(tt) %% 5 will return the number of minutes since the last fifth minute. For our purposes, we want to extract from this the last index from each series of 4s, which is the last observation for the minute preceding each fifth minute. To do that we can use diff and just extract the the index for which there is a cross over from 4 to 0 (resulting in a diff of -4) using which.

To illustrate, we modify your posted data to add observations that actually satisfy your extraction condition:

tt <- structure(c(1371.25, NA, 1373.95, NA, NA, 1373, NA, 1373.95, 
            1373.9, NA, NA, 1374, 1374.15, NA, 1374, 1373.85, 1372.55, 1374.05, 
            1374.15, 1374.75, NA, NA, 1375.9, 1374.05, NA, NA, NA, NA, NA, 
            NA, NA, 1375, NA, NA, NA, NA, NA, 1376.35, NA, NA, NA, NA, NA, 
            NA, NA, NA, NA, NA, 1376.25, NA, 1378, 1376.5, NA, NA, NA, 1378, 
            1378, NA, NA, 1378.8, 231.9, 231.85, NA, 231.9, 231.85, 231.9, 
            231.8, 231.9, 232.6, 231.95, 232.35, 232, 232.1, 232.05, 232.05, 
            232.05, 231.5, 231.3, NA, NA, 231.1, 231.1, 231.1, 231, 231, 
            230.95, 230.6, 230.6, 230.7, 230.6, 231, NA, 231, 231, 231.45, 
            231.65, 231.4, 231.7, 231.3, 231.25, 231.25, 231.4, 231.4, 231.85, 
            231.75, 231.5, 231.55, 231.35, NA, 231.5, 231.5, NA, 231.5, 231.25, 
            231.15, 231, 231, 231, 231.05, NA), .indexCLASS = c("POSIXct", 
                                                                "POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "Asia/Calcutta", tzone = "Asia/Calcutta", index = structure(c(1459482299, 
                                                                                                                                                                                     1459482301, 1459482302, 1459482303, 1459482304, 1459482305, 1459482306, 
                                                                                                                                                                                     1459482307, 1459482309, 1459482310, 1459482311, 1459482312, 1459482314, 
                                                                                                                                                                                     1459482315, 1459482316, 1459482317, 1459482318, 1459482319, 1459482320, 
                                                                                                                                                                                     1459482321, 1459482322, 1459482323, 1459482324, 1459482326, 1459482328, 
                                                                                                                                                                                     1459482329, 1459482330, 1459482331, 1459482332, 1459482336, 1459482337, 
                                                                                                                                                                                     1459482338, 1459482339, 1459482342, 1459482344, 1459482346, 1459482347, 
                                                                                                                                                                                     1459482348, 1459482349, 1459482590, 1459482591, 1459482594, 1459482595, 
                                                                                                                                                                                     1459482596, 1459482597, 1459482598, 1459482599, 1459482602, 1459482603, 
                                                                                                                                                                                     1459482604, 1459482609, 1459482610, 1459482611, 1459482612, 1459482613, 
                                                                                                                                                                                     1459482618, 1459482619, 1459482620, 1459482622, 1459482628), tzone = "Asia/Calcutta", tclass = c("POSIXct", 
                                                                                                                                                                                                                                                                                      "POSIXt")), .Dim = c(60L, 2L), .Dimnames = list(NULL,c("A", 
                                                                                                                                                                                                                                                                                                                                              "B")), class = c("xts", "zoo"))
##                          A      B
##2016-04-01 09:14:59 1371.25 231.90
##2016-04-01 09:15:01      NA 231.85
##2016-04-01 09:15:02 1373.95     NA
##2016-04-01 09:15:03      NA 231.90
##2016-04-01 09:15:04      NA 231.85
##2016-04-01 09:15:05 1373.00 231.90
##2016-04-01 09:15:06      NA 231.80
##2016-04-01 09:15:07 1373.95 231.90
##2016-04-01 09:15:09 1373.90 232.60
##2016-04-01 09:15:10      NA 231.95
##2016-04-01 09:15:11      NA 232.35
##2016-04-01 09:15:12 1374.00 232.00
##2016-04-01 09:15:14 1374.15 232.10
##2016-04-01 09:15:15      NA 232.05
##2016-04-01 09:15:16 1374.00 232.05
##2016-04-01 09:15:17 1373.85 232.05
##2016-04-01 09:15:18 1372.55 231.50
##2016-04-01 09:15:19 1374.05 231.30
##2016-04-01 09:15:20 1374.15     NA
##2016-04-01 09:15:21 1374.75     NA
##2016-04-01 09:15:22      NA 231.10
##2016-04-01 09:15:23      NA 231.10
##2016-04-01 09:15:24 1375.90 231.10
##2016-04-01 09:15:26 1374.05 231.00
##2016-04-01 09:15:28      NA 231.00
##2016-04-01 09:15:29      NA 230.95
##2016-04-01 09:15:30      NA 230.60
##2016-04-01 09:15:31      NA 230.60
##2016-04-01 09:15:32      NA 230.70
##2016-04-01 09:15:36      NA 230.60
##2016-04-01 09:15:37      NA 231.00
##2016-04-01 09:15:38 1375.00     NA
##2016-04-01 09:15:39      NA 231.00
##2016-04-01 09:15:42      NA 231.00
##2016-04-01 09:15:44      NA 231.45
##2016-04-01 09:15:46      NA 231.65
##2016-04-01 09:15:47      NA 231.40
##2016-04-01 09:15:48 1376.35 231.70
##2016-04-01 09:15:49      NA 231.30
##2016-04-01 09:19:50      NA 231.25
##2016-04-01 09:19:51      NA 231.25
##2016-04-01 09:19:54      NA 231.40
##2016-04-01 09:19:55      NA 231.40
##2016-04-01 09:19:56      NA 231.85
##2016-04-01 09:19:57      NA 231.75
##2016-04-01 09:19:58      NA 231.50
##2016-04-01 09:19:59      NA 231.55
##2016-04-01 09:20:02      NA 231.35
##2016-04-01 09:20:03 1376.25     NA
##2016-04-01 09:20:04      NA 231.50
##2016-04-01 09:20:09 1378.00 231.50
##2016-04-01 09:20:10 1376.50     NA
##2016-04-01 09:20:11      NA 231.50
##2016-04-01 09:20:12      NA 231.25
##2016-04-01 09:20:13      NA 231.15
##2016-04-01 09:20:18 1378.00 231.00
##2016-04-01 09:20:19 1378.00 231.00
##2016-04-01 09:20:20      NA 231.00
##2016-04-01 09:20:22      NA 231.05
##2016-04-01 09:20:28 1378.80     NA

With this data, we get:

print(res)
##                          A      B
##2016-04-01 09:14:59 1371.25 231.90
##2016-04-01 09:19:59      NA 231.55

To get the output you posted, you would need first generate a time series that have data (set to NA) for every 5 minute ticks that you want. For this example, this time series (only for 5 minute ticks from 09:00 to 09:20 on 2016-04-01) can be:

every.5.min <- structure(c(NA, NA, NA, NA, NA), .Dim = c(5L, 1L), .Dimnames = list(
NULL, "Empty"), index = structure(c(1459481400, 1459481700, 
1459482000, 1459482300, 1459482600), tzone = "Asia/Calcutta", tclass = c("POSIXct", 
"POSIXt")), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", 
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "Asia/Calcutta", tzone = "Asia/Calcutta")
##                    Empty
##2016-04-01 09:00:00    NA
##2016-04-01 09:05:00    NA
##2016-04-01 09:10:00    NA
##2016-04-01 09:15:00    NA
##2016-04-01 09:20:00    NA

Then, merge this with tt:

tt <- merge(tt, every.5.min, all=TRUE)[,1:ncol(tt)]

The all=TRUE will fill rows in the original tt with NA if that row (i.e., every 5 min) does not exist in tt. Note after the merge, we only keep the columns from the original tt.

Then, over tt, fill all NA with prior values as you did:

res <- do.call(merge, lapply(tt, na.locf))

Finally, extract only those rows for each 5 min tick using .indexmin and .indexsec:

res <- res[.indexmin(res) %% 5 == 0 & .indexsec(res) == 0]
##                          A      B
##2016-04-01 09:00:00      NA     NA
##2016-04-01 09:05:00      NA     NA
##2016-04-01 09:10:00      NA     NA
##2016-04-01 09:15:00 1371.25 231.90
##2016-04-01 09:20:00 1376.35 231.55
aichao
  • 7,375
  • 3
  • 16
  • 18
  • this is very interesting. It would be nice if we can replace each NA with the most recent non-NA for A and B starting from 9.00, 9.05, 9.10.. and so on. or as with this data: 9.15, 9.20, ... – runjumpfly Oct 04 '16 at 14:12
  • Sorry for late response. – runjumpfly Oct 04 '16 at 14:25
  • @runjumpfly: sorry for the delayed on my end. Work has been busy lately. Please see if my edit suffices. – aichao Oct 05 '16 at 12:29
  • Error in mapply(function(b, e) { : object 'ind' not found . What does be and e mean? – runjumpfly Oct 07 '16 at 10:14
  • aichao thank you for the help.Please see the table I inserted and let me know if I have been able to illustrate what I need. – runjumpfly Oct 07 '16 at 11:21
  • @runjumpfly: `b` and `e` are the begin and end indices for each 5 minute interval. The error you got is because you need to execute the `mapply` command **after** finding `ind` using `ind <- which(diff(.indexmin(tt) %% 5) == -4)`, which is part of the original answer. – aichao Oct 07 '16 at 12:04
  • aichao, can we first fill all the NAs with prior values in A and B using tt1<-lapply(tt, na.locf) then use ind <- which(diff(.indexmin(tt1) %% 5) == -4) res <- tt1[ind] but tt1 becomes a list instead of xts – runjumpfly Oct 07 '16 at 12:55
  • @runjumpfly: You can, but what are those prior values? The `ind` gives your the row indices (i.e., time) of your time series that are the last observation before each 5 minute mark. I thought that these are the "prior" values (assumed not NA) that you want to use to fill those NAs in the following 5 minute interval. That is what the `mapply` command does. – aichao Oct 07 '16 at 13:04
  • I did it with tt1<-do.call(merge, lapply(tt, na.locf)), and then used first part of your answer, and I get this dput(res) structure(c(1371.25, 1376.35, 231.9, 231.55), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = "Asia/Calcutta", tclass = c("POSIXct", "POSIXt"), tzone = "Asia/Calcutta", index = structure(c(1459482299, 1459482599), tzone = "Asia/Calcutta", tclass = c("POSIXct", "POSIXt" )), .Dim = c(2L, 2L), .Dimnames = list(NULL, c("A", "B"))) – runjumpfly Oct 07 '16 at 13:12
  • Can we make it look like the table at the end of my question? please – runjumpfly Oct 07 '16 at 13:14
  • @runjumpfly: I believe I finally understand what you are saying. Please see my edit. – aichao Oct 07 '16 at 14:28
  • aichao the result is perfect, however, the making the NA dataset every.5.min will give me trouble as I have many stocks(A, B, C, ...) for many days. – runjumpfly Oct 07 '16 at 14:45
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/125185/discussion-between-aichao-and-runjumpfly). – aichao Oct 07 '16 at 14:47
  • thank you very much for helping me. I have figured it out due to your suggestions related to manipulating xts objects in the answer and so I am accepting it. – runjumpfly Oct 12 '16 at 13:19
  • I need to chat with you. – runjumpfly Oct 29 '16 at 08:51
1

Change the xts object into a data frame to work with dplyr

library(dplyr)
library(tibble)
library(xts)
library(tidyr)
dtf <- tt %>% 
    as.data.frame() %>%
    # add time information
    rownames_to_column("time") %>%
    mutate(time = as.POSIXct(time))

Generate the vector of times to pick every 5 minutes (300 seconds) between min and max time

timepick <- seq(trunc(min(dtf$time),"hour"), # start at the hour
                max(dtf$time)+300 , 300)

Use the vector of breaks to pick the last available observation in each 5 minutes time interval.

ag.5min.tt <- dtf %>%
    # Add missing interval
    full_join(data_frame(time = timepick), by = "time") %>%
    arrange(time) %>% # important to arrange by time here
    # Replace each NA with the most recent non-NA
    fill(-time) %>% 
    # take selected values only
    filter(time %in% timepick) 

Convert back to an xts object

ag.5min.tt <- ag.5min.tt %>% 
    as.data.frame() %>% 
    column_to_rownames("time") %>% 
    as.xts()
ag.5min.tt

                          A      B
2016-04-01 09:00:00      NA     NA
2016-04-01 09:05:00      NA     NA
2016-04-01 09:10:00 1371.25 231.90
2016-04-01 09:15:00 1371.25 231.90
2016-04-01 09:20:00 1378.80 231.05
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
  • Thanks. It seems the answer has 10 min intervals. Please elaborate. – runjumpfly Oct 04 '16 at 14:15
  • That's because there is no data in the 09:05 09:10 interval. You can merge with the timepick vector to add this interval, then fill data in. – Paul Rougieux Oct 05 '16 at 07:22
  • I updated the answer to include the 09:05 09:10 interval – Paul Rougieux Oct 05 '16 at 10:38
  • Thank you Paul. May be I have not been able to convey what result I need. Please see the table in the question and let me know Is my question clearer now. – runjumpfly Oct 07 '16 at 11:20
  • In essence you are looking at the end of the interval while I supplied the beginning of the 5 minutes interval. Your last table in the question says that at `09h15` previous A value = `1371.25` OK. But I don't understand why at `09h20` previous A value = `1376.35`. In the `tt` dataset you gave, previous A value should be equal to `1378.80`. I also don't understand why at `09h10` previous A value is `NA` since there is a `1371.25` value at `09h07`. – Paul Rougieux Oct 12 '16 at 06:53
  • 1
    I updated the answer to display the end of the interval. The use of cut() turns out not to be necessary. After merging and filling the data, I just filter() relevant observations present in the `timepick` vector. – Paul Rougieux Oct 12 '16 at 07:09
  • Thank you @Paul . Your suggestions helped me to know more about dplyr. But xts suggestions were more applicable to my problem. Thanks again. – runjumpfly Oct 12 '16 at 13:21