How to select rows where two dates are close to N-days apart for various number of N's efficiently?

Question

Let's say I have the following data.table :

DT = structure(list(date = structure(c(17774, 16545, 15398, 17765, 
17736, 16342, 15896, 17928, 16692, 18022), class = "Date"), exdate = structure(c(17809, 
16549, 15605, 17781, 17746, 16361, 16060, 17977, 16724, 18033
), class = "Date"), price_at_entry = c(301.66, 205.27, 33.81, 
321.64, 297.43, 245.26, 122.27, 312.21, 253.19, 255.34), strike_price = c(195, 
212.5, 37, 255, 430, 120, 46, 320, 440, 245)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

DT[, `:=`(DTE = as.integer(difftime(exdate, date, unit = 'days')))]

          date     exdate price_at_entry strike_price DTE
 1: 2018-08-31 2018-10-05         301.66        195.0  35
 2: 2015-04-20 2015-04-24         205.27        212.5   4
 3: 2012-02-28 2012-09-22          33.81         37.0 207
 4: 2018-08-22 2018-09-07         321.64        255.0  16
 5: 2018-07-24 2018-08-03         297.43        430.0  10
 6: 2014-09-29 2014-10-18         245.26        120.0  19
 7: 2013-07-10 2013-12-21         122.27         46.0 164
 8: 2019-02-01 2019-03-22         312.21        320.0  49
 9: 2015-09-14 2015-10-16         253.19        440.0  32
10: 2019-05-06 2019-05-17         255.34        245.0  11

I want to subset the data.table for days which DTE is within 10 units of various DTE_target values. My current solution is to use rbindlist and lapply to basically loop through the values of DTE_target. Something like this:

rbindlist(
  lapply(
    c(7,30,60,90), function(DTE_target){
      data[data[,.I[abs(DTE-DTE_target) == min(abs(DTE-DTE_target)) 
                    & abs(DTE-DTE_target) < 10], by = date]$V1][ , DTE_target := DTE_target]
    })
)

         date     exdate price_at_entry strike_price DTE DTE_target
1: 2015-04-20 2015-04-24         205.27        212.5   4          7
2: 2018-08-22 2018-09-07         321.64        255.0  16          7
3: 2018-07-24 2018-08-03         297.43        430.0  10          7
4: 2019-05-06 2019-05-17         255.34        245.0  11          7
5: 2018-08-31 2018-10-05         301.66        195.0  35         30
6: 2015-09-14 2015-10-16         253.19        440.0  32         30

Is there a more data.table like efficient solution? I need to basically use this process on potentially billions of rows. I am also open to a PostgreSQL solution if possible as well. Also after obtaining the above result, I repeat a similar process using price_at_entry and strike_price. ( which in its current form introduces even more looping )

Maybe it's possible to use rolling joins? If I join data on itself using date and exdate as the keys and roll = 10. But I cannot seem to get a solution that makes sense.

Any help would be appreciated. Thanks!

EDIT:::::

I can't believe I missed this... Here is a potential solution that I need to keep exploring but seems to be very efficient.

DTE_target = c(7,14,30,60,90,120,150, 180, 210, 240, 270, 300)
# create a map of Target DTEs with the +/- range 
# ( for some reason i have to duplicate the column for the join to pull DTE_target)
DTE_table = data.table(DTE = DTE_target, DTE_low = DTE_target - 10, 
                       DTE_high = DTE_target + 10, 
                       DTE_target = DTE_target)
# map on nearest
DTE_table[DT, on = .(DTE), roll = "nearest"]
# subset on low/high range
DTE_table[DT, on = .(DTE), roll = "nearest"][DTE >= DTE_low & DTE <= DTE_high]

EDIT::::

based on @Henrik's comment

DT[DTE_table, on = .(DTE >= DTE_low, DTE <= DTE_high), DTE_target := i.DTE_target]

I have have understood you correctly, perhaps a non-equi join: `d2 = data.table(DTE_target = c(7,30,60,90))`; ``d2[ , `:=`(from = DTE_target - 10, to = DTE_target + 10)]``; `DT[d2, on = .(DTE >= from, DTE <= to), .(date, DTE = x.DTE, DTE_target), nomatch = NULL]`. See e.g. [Overlap join with start and end positions](https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions) or loads of other similar posts. — Henrik, May 28 '21 at 19:19

Sathish · Answer 1 · 2021-05-28T22:39:33.173

For each DTE_target, find DTE rows within 10 units range. It will output a boolean array.

DT[, DTE := as.integer(difftime(exdate, date, unit = 'days')) ]
DTE_target <- c(7,30, 60, 90)
val = 10
bool_arr <- DT[, lapply(DTE_target, function(x) abs(DTE - x) <= val) ]

Then loop through the array and find any row with TRUE. Use it to extract the rows from the original DT datatable.

selected_rows <- apply(bool_arr, 1, any)
DT[selected_rows, ]

Here is full code and output

library(data.table)
DTE_target <- c(7,30, 60, 90)
val = 10   # 10 units value
DT[apply(DT[, lapply(DTE_target, function(x) abs(DTE - x) <= val) ], 1, any), ]

# date     exdate price_at_entry strike_price DTE
#1: 2018-08-31 2018-10-05         301.66        195.0  35
#2: 2015-04-20 2015-04-24         205.27        212.5   4
#3: 2018-08-22 2018-09-07         321.64        255.0  16
#4: 2018-07-24 2018-08-03         297.43        430.0  10
#5: 2015-09-14 2015-10-16         253.19        440.0  32
#6: 2019-05-06 2019-05-17         255.34        245.0  11

Now use the filtered dataset to perform above function on other columns: price_at_entry and strike_price

Since you have a billion rows in data, you can split data into chunks apply the above function to speed things up.

Solution - 2: using mutually not exclusive target values: 30 and 31

DTE_target <- c(7,30, 31, 60, 90)

bool_arr <- DT[, lapply(DTE_target, function(x) abs(DTE - x) <= val) ]
target_vals <- apply(bool_arr, 1, any)
dt_vals <- apply(bool_arr, 1, function(x) DTE_target[x])
rm(bool_arr)  # remove bool_arr from memory to free up space
DT[target_vals, ][, `:=`(DTE_t = dt_vals[target_vals])][]
rm(target_vals)
rm(dt_vals)    

#         date     exdate price_at_entry strike_price DTE DTE_t
#1: 2018-08-31 2018-10-05         301.66        195.0  35 30,31
#2: 2015-04-20 2015-04-24         205.27        212.5   4     7
#3: 2018-08-22 2018-09-07         321.64        255.0  16     7
#4: 2018-07-24 2018-08-03         297.43        430.0  10     7
#5: 2015-09-14 2015-10-16         253.19        440.0  32 30,31
#6: 2019-05-06 2019-05-17         255.34        245.0  11     7

Solution -3

Data:

library(data.table)
setDT(DT)
DT = rbindlist( lapply( 1:10^6, function(i){ DT } ) )
DTE_target <- c(7,30, 31, 60, 90)
val=10

Code

system.time({
  DT[, id := .I]
  DT[, DTE := as.integer(difftime(exdate, date, unit = 'days')) ]
  DT[, DTE_t := paste(DTE_target[ abs(DTE - DTE_target)<=val], collapse = "," ), by = id]
  DT[, id := NULL]
})

#user  system elapsed 
#91.90    0.46   92.48

Output:

head(DT, 10)
#          date     exdate price_at_entry strike_price DTE DTE_t
# 1: 2018-08-31 2018-10-05         301.66        195.0  35 30,31
# 2: 2015-04-20 2015-04-24         205.27        212.5   4     7
# 3: 2012-02-28 2012-09-22          33.81         37.0 207      
# 4: 2018-08-22 2018-09-07         321.64        255.0  16     7
# 5: 2018-07-24 2018-08-03         297.43        430.0  10     7
# 6: 2014-09-29 2014-10-18         245.26        120.0  19      
# 7: 2013-07-10 2013-12-21         122.27         46.0 164      
# 8: 2019-02-01 2019-03-22         312.21        320.0  49      
# 9: 2015-09-14 2015-10-16         253.19        440.0  32 30,31
# 10: 2019-05-06 2019-05-17         255.34        245.0  11     7

oh wow. just tested this on 1 million rows and its INSANELY better than my solution. Is there any way to keep the `DTE_target` as a column as well? I need to group by that variable in the future — road_to_quantdom, May 28 '21 at 17:45
since you have many target values for DTE, you might end up with many values selecting the same row. In other words, some rows in `DTE_target` column will have more than one value. Do you need this type of behavior? `DTE_target = c(30, 31)` will select rows with DTE column values: 32, 35. — Sathish, May 28 '21 at 17:49
yeah that kind of over-lap is fine. The `DTE_target` will be incremented in a way that does not conflict with `val = 10` — road_to_quantdom, May 28 '21 at 17:53
sorry, after testing it on a larger dataset its actually slower than my current solution. it works really well without `dt_vals` . More than 100x faster. But including the `dt_vals` line slows it down considerably. Unfortunately, I need the `dt_vals` column in the final result — road_to_quantdom, May 28 '21 at 18:44
slower. Just use a bunch of copies of this small example to simulate the million `DT = rbindlist( lapply( 1:10^6, function(i){ DT } ) )` — road_to_quantdom, May 28 '21 at 19:01
see my edit , i think using joins might be the way to go here — road_to_quantdom, May 28 '21 at 19:16
See Solution-3. It is the best we can get with data.table, in my opinion. Doing join operation as you showed in your question will result in wrong result, because data collision occurs with target values, which will lead to side effects like overriding the data. The solution-3 will not show any side effect. Hope this helps. — Sathish, May 28 '21 at 22:26

How to select rows where two dates are close to N-days apart for various number of N's efficiently?

1 Answers1