4

Here a MWE of my problem.

Data:

library(data.table)
#dates in %Y-%m-%d    
    df <- data.table(date=as.Date(c("2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02")), dtm=c(18L,  18L, 18L,  18L,  18L, 18L,46L,46L,74L, 74L,74L,74L,165L, 165L,165L,165L), cval=c(1275L, 1300L, 1300L, 1320L, 1325L, 1325L, 1300L, 1300L, 1300L, 1300L, 1325L, 1325L, 1300L, 1300L, 1325L, 1325L), price_in=c(24.125, 24.625, 35.750, 16.250, 14.500, 50.250, 43.625, 49.125, 58.250, 58.250, 45.375, 70.125, 90.750, 74.750, 77.875, 85.500), price_out=c(26.125, 26.625, 36.625, 17.500, 15.500, 52.250, 45.625, 51.125, 60.000, 60.250, 47.375, 72.125, 92.750, 76.750, 79.875, 87.500), type=c("P", "C", "P", "C", "C", "P", "C", "P", "C", "P", "C", "P", "C", "P", "C", "P"))
    df
          date dtm cval price_in price_out type
 1: 2001-01-02  18 1275   24.125    26.125    P
 2: 2001-01-02  18 1300   24.625    26.625    C
 3: 2001-01-02  18 1300   35.750    36.625    P
 4: 2001-01-02  18 1320   16.250    17.500    C
 5: 2001-01-02  18 1325   14.500    15.500    C
 6: 2001-01-02  18 1325   50.250    52.250    P
 7: 2001-01-02  46 1300   43.625    45.625    C
 8: 2001-01-02  46 1300   49.125    51.125    P
 9: 2001-01-02  74 1300   58.250    60.000    C
10: 2001-01-02  74 1300   58.250    60.250    P
11: 2001-01-02  74 1325   45.375    47.375    C
12: 2001-01-02  74 1325   70.125    72.125    P
13: 2001-01-02 165 1300   90.750    92.750    C
14: 2001-01-02 165 1300   74.750    76.750    P
15: 2001-01-02 165 1325   77.875    79.875    C
16: 2001-01-02 165 1325   85.500    87.500    P

What I want to do:

  1. For each date, I want to obtain all items, divided in type P's and C's that have the same dtm but a larger cval. For the second item in the example data set, this would be:

date dtm cval price_in price_out type 2001-01-02 18 1300 24.625 26.625 C #the item 2001-01-02 18 1320 16.250 17.500 C #same dtm, higher cval 2001-01-02 18 1325 14.500 15.500 C #same dtm, higher cval

  1. Now, let cval1 be the cval of the current item, i.e. here cval1 = 1300 and cval2 the larger cval's of the items in this subset, i.e. here cval2 = c(1320L, 1325L). Then, I want to apply a custom exclusion function, for example let's say price_in[cval %in% cval2]-price_out[cval==cval1]-0.5*(cval1-cval2) < 0
  2. I then want to exclude all item pairs whose exclusion function returned TRUE.

Similarly (same procedure, different exclusion criteria) applies to the P items.

Expected Output: The original data.table, df, minus the rows that were excluded in the procedure just described above. For example, using the example function above evaluating items 2 and 4 returns TRUE: 16.25-26.625-0.5*(1300-1320) = -0.375 < 0. Thus, the expected output would be df without rows 2 and 4 (note that the pair 2 and 5 does not return TRUE: 14.5-26.625-0.5*(1300-1325) = 0.375 >= 0, hence 5 is not excluded):

          date dtm cval price_in price_out type
 1: 2001-01-02  18 1275   24.125    26.125    P
 3: 2001-01-02  18 1300   35.750    36.625    P
 5: 2001-01-02  18 1325   14.500    15.500    C
 6: 2001-01-02  18 1325   50.250    52.250    P
 7: 2001-01-02  46 1300   43.625    45.625    C
 8: 2001-01-02  46 1300   49.125    51.125    P
          ...      ...       ...       ...     

and so on. Obviously, as in the case of items 7 and 8, if there is no other item with the same characteristics (same date, dtm and type), it cannot be excluded.

What I have tried so far:

  1. I have created an id for each item, i.e. df[,id:=seq_along(date)], then iterated via for loop through the dates and used vectors to check my custom functions. If the result vector contained TRUE's, I removed the corresponding indices from my data.table. Clearly, this approach works but runs almost forever given the size of my data.
  2. I am currently experimenting with data.table methods, because of their extreme speed advantages. If I understand it correctly, what I want are many rolling self-joins for each date/dtm subset, something along the lines "df[df,roll=Inf,by=.(date,dtm)]" (since a total, rolling self-join, I believe, is not applicable in this case). But I don't quite get it to work.

Question: Is there a way of implementing this exclusion procedure via data.table methods? Possibly (but not necessarily) via multiple rolling self-joins?

Any help would be highly appreciated!

Daedalus
  • 235
  • 1
  • 2
  • 9
  • 2
    Really, you should explicitly show the full terminal desired output, per https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 – Frank Sep 10 '17 at 18:42
  • Thanks Frank. Given that I am using the exclusion criteria described, I thought it's clear that the output would be again the same data frame df just missing the rows corresponding to the items violating said criteria. I will see if I can make it even clearer somewhere. – Daedalus Sep 10 '17 at 18:46
  • 2
    Quoting from the link: "This does take some work, but seems like a fair trade off since you are asking others to do work for you." It is not about whether we *can* figure it out. As a matter of etiquette here, it's your responsibility, not ours. Anyway, you're free to demand answerers do that, just less likely to get an answer that doesn't require your clarifying comments later. – Frank Sep 10 '17 at 19:04
  • My apologies. I am still learning on how to post questions here, and I truly hope this question shows improvements over my last one. I have edited my expected output as an extra section so even if the reader just skims through my text he can see what I am after. Anything else that needs improvement? I do appreciate your help, Frank. – Daedalus Sep 10 '17 at 19:10
  • 1
    @Daedalus - Would you mind including a numeric equivalent of the expression "price_in[cval %in% cval2]-price_out[cval==cval1]-0.9*(cval1-cval2) >= 0"? I'm following most of your question, but I'm not seeing how you are using the above to return only row "4: 2001-01-02 18 1320 16.250 17.500 C" when dtm==18 and type==C. – www Sep 10 '17 at 23:13
  • @RyanRunge My bad, the equation was meant to be completely arbitrary, so I just assumed some items to violate a given criterion without matching them to the example equation. However, as you correctly pointed out, this might lead to confusion. Fixed it (amended equation to 0.5 factor and made it negative) and added a numerical calculation example. I hope my math checks out. – Daedalus Sep 11 '17 at 08:23
  • General note: I do not require this to be solved by rolling self-joins, that's just my latest idea/understanding/research on how to implement it with data.table. If there are any other ideas on how to (efficiently) deal with this that improve my for-loop based approach, I would very much appreciate those! – Daedalus Sep 11 '17 at 08:26

0 Answers0