Here a MWE of my problem.
Data:
library(data.table)
#dates in %Y-%m-%d
df <- data.table(date=as.Date(c("2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02")), dtm=c(18L, 18L, 18L, 18L, 18L, 18L,46L,46L,74L, 74L,74L,74L,165L, 165L,165L,165L), cval=c(1275L, 1300L, 1300L, 1320L, 1325L, 1325L, 1300L, 1300L, 1300L, 1300L, 1325L, 1325L, 1300L, 1300L, 1325L, 1325L), price_in=c(24.125, 24.625, 35.750, 16.250, 14.500, 50.250, 43.625, 49.125, 58.250, 58.250, 45.375, 70.125, 90.750, 74.750, 77.875, 85.500), price_out=c(26.125, 26.625, 36.625, 17.500, 15.500, 52.250, 45.625, 51.125, 60.000, 60.250, 47.375, 72.125, 92.750, 76.750, 79.875, 87.500), type=c("P", "C", "P", "C", "C", "P", "C", "P", "C", "P", "C", "P", "C", "P", "C", "P"))
df
date dtm cval price_in price_out type
1: 2001-01-02 18 1275 24.125 26.125 P
2: 2001-01-02 18 1300 24.625 26.625 C
3: 2001-01-02 18 1300 35.750 36.625 P
4: 2001-01-02 18 1320 16.250 17.500 C
5: 2001-01-02 18 1325 14.500 15.500 C
6: 2001-01-02 18 1325 50.250 52.250 P
7: 2001-01-02 46 1300 43.625 45.625 C
8: 2001-01-02 46 1300 49.125 51.125 P
9: 2001-01-02 74 1300 58.250 60.000 C
10: 2001-01-02 74 1300 58.250 60.250 P
11: 2001-01-02 74 1325 45.375 47.375 C
12: 2001-01-02 74 1325 70.125 72.125 P
13: 2001-01-02 165 1300 90.750 92.750 C
14: 2001-01-02 165 1300 74.750 76.750 P
15: 2001-01-02 165 1325 77.875 79.875 C
16: 2001-01-02 165 1325 85.500 87.500 P
What I want to do:
- For each date, I want to obtain all items, divided in type
P
's andC
's that have the samedtm
but a largercval
. For the second item in the example data set, this would be:
date dtm cval price_in price_out type
2001-01-02 18 1300 24.625 26.625 C #the item
2001-01-02 18 1320 16.250 17.500 C #same dtm, higher cval
2001-01-02 18 1325 14.500 15.500 C #same dtm, higher cval
- Now, let
cval1
be thecval
of the current item, i.e. herecval1 = 1300
andcval2
the largercval
's of the items in this subset, i.e. herecval2 = c(1320L, 1325L)
. Then, I want to apply a custom exclusion function, for example let's sayprice_in[cval %in% cval2]-price_out[cval==cval1]-0.5*(cval1-cval2) < 0
- I then want to exclude all item pairs whose exclusion function returned
TRUE
.
Similarly (same procedure, different exclusion criteria) applies to the P
items.
Expected Output: The original data.table, df
, minus the rows that were excluded in the procedure just described above. For example, using the example function above evaluating items 2 and 4 returns TRUE: 16.25-26.625-0.5*(1300-1320) = -0.375 < 0
. Thus, the expected output would be df
without rows 2 and 4 (note that the pair 2 and 5 does not return TRUE: 14.5-26.625-0.5*(1300-1325) = 0.375 >= 0
, hence 5 is not excluded):
date dtm cval price_in price_out type
1: 2001-01-02 18 1275 24.125 26.125 P
3: 2001-01-02 18 1300 35.750 36.625 P
5: 2001-01-02 18 1325 14.500 15.500 C
6: 2001-01-02 18 1325 50.250 52.250 P
7: 2001-01-02 46 1300 43.625 45.625 C
8: 2001-01-02 46 1300 49.125 51.125 P
... ... ... ...
and so on. Obviously, as in the case of items 7 and 8, if there is no other item with the same characteristics (same date, dtm and type), it cannot be excluded.
What I have tried so far:
- I have created an id for each item, i.e.
df[,id:=seq_along(date)]
, then iterated viafor loop
through the dates and used vectors to check my custom functions. If the result vector containedTRUE
's, I removed the corresponding indices from my data.table. Clearly, this approach works but runs almost forever given the size of my data. - I am currently experimenting with data.table methods, because of their extreme speed advantages. If I understand it correctly, what I want are many rolling self-joins for each
date/dtm
subset, something along the lines"df[df,roll=Inf,by=.(date,dtm)]"
(since a total, rolling self-join, I believe, is not applicable in this case). But I don't quite get it to work.
Question: Is there a way of implementing this exclusion procedure via data.table methods? Possibly (but not necessarily) via multiple rolling self-joins?
Any help would be highly appreciated!