So I have time and sales data with 1 billion rows like this:
datetime price
"2016-05-01 18:00:02" 2060.75
"2016-05-01 18:00:22" 2060.50
"2016-05-01 18:00:35" 2060.50
"2016-05-01 18:01:59" 2060.75
"2016-05-01 18:03:21" 2061.00
"2016-05-01 18:03:21" 2061.25
"2016-05-01 18:03:42" 2061.00
"2016-05-01 18:04:22" 2061.00
"2016-05-01 18:04:25" 2061.25
"2016-05-01 18:04:44" 2061.50
"2016-05-01 18:06:41" 2061.50
and I have a function that at each minute interval will give the most recent price:
datetime price
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:03:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:06:00" 2061.50
"2016-05-01 18:07:00" 2061.50
My function rounds the times up to the nearest minute:
datetime price
"2016-05-01 18:01:00" 2060.75
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:04:00" 2061.25
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.00
"2016-05-01 18:05:00" 2061.25
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:07:00" 2061.50
and then starting from the bottom and moving up, removes the rows with duplicate times:
datetime price
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:07:00" 2061.50
and then add in the minutes that are missing:
datetime price
"2016-05-01 18:01:00" 2060.50
"2016-05-01 18:02:00" 2060.75
"2016-05-01 18:03:00" 2060.75
"2016-05-01 18:04:00" 2061.00
"2016-05-01 18:05:00" 2061.50
"2016-05-01 18:06:00" 2061.50
"2016-05-01 18:07:00" 2061.50
I'v'e tried many different functions but this was the quickest way I could find and function still works slow and I think there must be a more efficient way of doing this that I can't think of. Can anyone help?