1

So I have time and sales data with 1 billion rows like this:

datetime               price 
"2016-05-01 18:00:02"  2060.75
"2016-05-01 18:00:22"  2060.50
"2016-05-01 18:00:35"  2060.50
"2016-05-01 18:01:59"  2060.75
"2016-05-01 18:03:21"  2061.00
"2016-05-01 18:03:21"  2061.25
"2016-05-01 18:03:42"  2061.00
"2016-05-01 18:04:22"  2061.00
"2016-05-01 18:04:25"  2061.25
"2016-05-01 18:04:44"  2061.50
"2016-05-01 18:06:41"  2061.50

and I have a function that at each minute interval will give the most recent price:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:03:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:06:00"  2061.50
"2016-05-01 18:07:00"  2061.50

My function rounds the times up to the nearest minute:

datetime               price 
"2016-05-01 18:01:00"  2060.75
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:04:00"  2061.25
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.00
"2016-05-01 18:05:00"  2061.25
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:07:00"  2061.50

and then starting from the bottom and moving up, removes the rows with duplicate times:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:07:00"  2061.50

and then add in the minutes that are missing:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:03:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:06:00"  2061.50
"2016-05-01 18:07:00"  2061.50

I'v'e tried many different functions but this was the quickest way I could find and function still works slow and I think there must be a more efficient way of doing this that I can't think of. Can anyone help?

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
AndrewK
  • 271
  • 2
  • 16

1 Answers1

3

You can use rolling joins from library(data.table) to do this in two steps

Create a data.table of all the 'minutes' of interest

dt_minutes <- data.table(datetime = seq(as.POSIXct("2016-05-01 18:00:00"), 
                                    length.out = 10, 
                                    by = "mins"))
dt_minutes

#                datetime
# 1: 2016-05-01 18:00:00
# 2: 2016-05-01 18:01:00
# 3: 2016-05-01 18:02:00
# 4: 2016-05-01 18:03:00
# 5: 2016-05-01 18:04:00
# 6: 2016-05-01 18:05:00
# 7: 2016-05-01 18:06:00
# 8: 2016-05-01 18:07:00
# 9: 2016-05-01 18:08:00
# 10: 2016-05-01 18:09:00

And use a rolling join to get the latest price for each minute

## you'll need to set your data to a data.table
# library(data.table)
# setDT(dt)
dt[dt_minutes, roll = TRUE, on = "datetime"]

#               datetime   price
# 1: 2016-05-01 18:00:00      NA
# 2: 2016-05-01 18:01:00 2060.50
# 3: 2016-05-01 18:02:00 2060.75
# 4: 2016-05-01 18:03:00 2060.75
# 5: 2016-05-01 18:04:00 2061.00
# 6: 2016-05-01 18:05:00 2061.50
# 7: 2016-05-01 18:06:00 2061.50
# 8: 2016-05-01 18:07:00 2061.50
# 9: 2016-05-01 18:08:00 2061.50
# 10: 2016-05-01 18:09:00 2061.50

Data

library(data.table)

dt <- fread('datetime               price 
"2016-05-01 18:00:02"  2060.75
"2016-05-01 18:00:22"  2060.50
"2016-05-01 18:00:35"  2060.50
"2016-05-01 18:01:59"  2060.75
"2016-05-01 18:03:21"  2061.00
"2016-05-01 18:03:21"  2061.25
"2016-05-01 18:03:42"  2061.00
"2016-05-01 18:04:22"  2061.00
"2016-05-01 18:04:25"  2061.25
"2016-05-01 18:04:44"  2061.50
"2016-05-01 18:06:41"  2061.50', header = T)

Here's a good blog post on rolling joins to get you started.

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • This looks perfect and seems to work really quick but when I'm using a 'dt' with '10421552' rows my 'dt_minutes' has '37380' minutes but when I do the rolling join it has '52100' rows instead of '37380' and it has the occasional minute repeated 25 times. Any idea what is going wrong? – AndrewK Jun 30 '16 at 03:15
  • @AndrewK It's hard to diagnose without seeing all your data. [This question and answer](http://stackoverflow.com/q/32007115/5977215) may give you some insights. – SymbolixAU Jun 30 '16 at 03:24
  • @AndrewK - can you reproduce your error with a smaller dataset, and post it as another question? – SymbolixAU Jun 30 '16 at 03:50
  • I worked out what I had done wrong. Thanks for your help. These rolling joins work fantastically. – AndrewK Jul 01 '16 at 01:21
  • @AndrewK - no problem; Rolling joins are a great feature. And in the next release of `data.table` they will be implementing `non-equi` joins too. – SymbolixAU Jul 01 '16 at 02:52
  • So I actually want the most recent price at each "2016-05-01 18:02:00" but not including that exact time. Do I need to change all the times to "2016-05-01 18:01:59.999" or is the a way to exclude the "2016-05-01 18:02:00"? – AndrewK Jul 01 '16 at 02:58
  • @AndrewK - your suggestion sounds like it should work - worth you testing it I suppose. – SymbolixAU Jul 01 '16 at 03:09
  • I created a new question with the problem I was having. I thought I worked out the solution but I was mistaken. http://stackoverflow.com/questions/38138172/r-unexpected-output-for-rolling-join-with-data-table – AndrewK Jul 01 '16 at 06:09