Let's say I have the following data.table
:
DT = structure(list(date = structure(c(17774, 16545, 15398, 17765,
17736, 16342, 15896, 17928, 16692, 18022), class = "Date"), exdate = structure(c(17809,
16549, 15605, 17781, 17746, 16361, 16060, 17977, 16724, 18033
), class = "Date"), price_at_entry = c(301.66, 205.27, 33.81,
321.64, 297.43, 245.26, 122.27, 312.21, 253.19, 255.34), strike_price = c(195,
212.5, 37, 255, 430, 120, 46, 320, 440, 245)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
DT[, `:=`(DTE = as.integer(difftime(exdate, date, unit = 'days')))]
date exdate price_at_entry strike_price DTE
1: 2018-08-31 2018-10-05 301.66 195.0 35
2: 2015-04-20 2015-04-24 205.27 212.5 4
3: 2012-02-28 2012-09-22 33.81 37.0 207
4: 2018-08-22 2018-09-07 321.64 255.0 16
5: 2018-07-24 2018-08-03 297.43 430.0 10
6: 2014-09-29 2014-10-18 245.26 120.0 19
7: 2013-07-10 2013-12-21 122.27 46.0 164
8: 2019-02-01 2019-03-22 312.21 320.0 49
9: 2015-09-14 2015-10-16 253.19 440.0 32
10: 2019-05-06 2019-05-17 255.34 245.0 11
I want to subset the data.table
for days which DTE is within 10 units of various DTE_target
values. My current solution is to use rbindlist
and lapply
to basically loop through the values of DTE_target
. Something like this:
rbindlist(
lapply(
c(7,30,60,90), function(DTE_target){
data[data[,.I[abs(DTE-DTE_target) == min(abs(DTE-DTE_target))
& abs(DTE-DTE_target) < 10], by = date]$V1][ , DTE_target := DTE_target]
})
)
date exdate price_at_entry strike_price DTE DTE_target
1: 2015-04-20 2015-04-24 205.27 212.5 4 7
2: 2018-08-22 2018-09-07 321.64 255.0 16 7
3: 2018-07-24 2018-08-03 297.43 430.0 10 7
4: 2019-05-06 2019-05-17 255.34 245.0 11 7
5: 2018-08-31 2018-10-05 301.66 195.0 35 30
6: 2015-09-14 2015-10-16 253.19 440.0 32 30
Is there a more data.table
like efficient solution? I need to basically use this process on potentially billions of rows. I am also open to a PostgreSQL
solution if possible as well. Also after obtaining the above result, I repeat a similar process using price_at_entry
and strike_price
. ( which in its current form introduces even more looping )
Maybe it's possible to use rolling joins? If I join data
on itself using date
and exdate
as the keys and roll = 10
. But I cannot seem to get a solution that makes sense.
Any help would be appreciated. Thanks!
EDIT:::::
I can't believe I missed this... Here is a potential solution that I need to keep exploring but seems to be very efficient.
DTE_target = c(7,14,30,60,90,120,150, 180, 210, 240, 270, 300)
# create a map of Target DTEs with the +/- range
# ( for some reason i have to duplicate the column for the join to pull DTE_target)
DTE_table = data.table(DTE = DTE_target, DTE_low = DTE_target - 10,
DTE_high = DTE_target + 10,
DTE_target = DTE_target)
# map on nearest
DTE_table[DT, on = .(DTE), roll = "nearest"]
# subset on low/high range
DTE_table[DT, on = .(DTE), roll = "nearest"][DTE >= DTE_low & DTE <= DTE_high]
EDIT::::
based on @Henrik's comment
DT[DTE_table, on = .(DTE >= DTE_low, DTE <= DTE_high), DTE_target := i.DTE_target]