1

I want to filter a data frame named final_arrange that has ~80000 rows filled with events. Each event is assigned to an entity, which is identified with a character stored in the column ticker. Each event also has an event range/window that is specified by the columns Event_start and Event_end. The result I want to get is that I have non-overlapping events only. Meaning that for each entity(ticker) there is just one event at a time without event windows overlapping.

(hopefully)reproducible example:

structure(list(ticker = c("AAP", "AAP", "AAP", "AAP", "AAP", 
"AAP", "AAP", "AAP", "AAP", "AAP", "AAP", "AAPL", "AAPL", "AAPL", 
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", 
"AAPL", "AAPL", "EFSC", "EFSC", "EFX", "EFX", "EFX"), Event_start = structure(c(17858, 
17941, 17950, 17962, 18033, 18130, 18207, 18217, 18305, 18396, 
18418, 17716, 17717, 17718, 17719, 17720, 17723, 17724, 17725, 
17726, 17727, 17731, 17732, 17744, 18367, 18458, 17718, 17732, 
17746), class = "Date"), Event_end = structure(c(17868, 17951, 
17960, 17972, 18043, 18140, 18217, 18227, 18315, 18406, 18428, 
17726, 17727, 17728, 17729, 17730, 17733, 17734, 17735, 17736, 
17737, 17741, 17742, 17754, 18377, 18468, 17728, 17742, 17756
), class = "Date")), row.names = c(NA, -29L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000002036a211ef0>, sorted = "Event_start")

I found a great thread with a data.table solution for a similar problem but struggle to tailor the code to my needs. Efficient way to drop rows with overlapping times The difference to my problem is that as a secondary condition I need to check if the current rows ticker has an exact match in a prior or previous one. If they do not have an exact match, a row should not be deleted, even if event windows overlap.

This is the code I got:

result <- setDT(final_arrange, key="Event_start")[!(Event_end >= shift(Event_start, type="lead", fill = T) & ticker == shift(ticker, type="lead", fill = T)) 
                                                       & !(Event_start <= shift(Event_end, type="lag", fill = F) & ticker == shift(ticker, type="lag", fill = T))]

The result it yields is:

    ticker Event_start  Event_end
    AAP  2018-11-23 2018-12-03
    AAP  2019-03-07 2019-03-17
    AAP  2019-05-17 2019-05-27
    AAP  2019-08-22 2019-09-01
    AAP  2020-02-13 2020-02-23
    AAP  2020-05-14 2020-05-24
    AAP  2020-06-05 2020-06-15
   AAPL  2018-08-01 2018-08-11
   EFSC  2020-04-15 2020-04-25
   EFSC  2020-07-15 2020-07-25
    EFX  2018-07-06 2018-07-16
    EFX  2018-07-20 2018-07-30
    EFX  2018-08-03 2018-08-13

What an desired output looks like:

    ticker Event_start  Event_end
    AAP  2018-11-23 2018-12-03
    AAP  2019-02-14 2019-02-24
    AAP  2019-03-07 2019-03-17
    AAP  2019-05-17 2019-05-27
    AAP  2019-08-22 2019-09-01
    AAP  2019-11-07 2019-11-17
    AAP  2020-02-13 2020-02-23
    AAP  2020-05-14 2020-05-24
    AAP  2020-06-05 2020-06-15
   AAPL  2018-07-04 2018-07-14
   AAPL  2018-07-15 2018-07-25
   AAPL  2018-08-01 2018-08-11
   EFSC  2020-04-15 2020-04-25
   EFSC  2020-07-15 2020-07-25
    EFX  2018-07-06 2018-07-16
    EFX  2018-07-20 2018-07-30
    EFX  2018-08-03 2018-08-13

There are some rows missing and I struggle to identify what the root of the problem is. Is it because using "==" to check if a row matches the prior or previous columns ticker is not resulting in exact matches only? I did not manage to use grepl (e.g. grepl(ticker, shift(ticker, type="lead", fill=T)) either (error message: "argument 'pattern' has length > 1 and only the first element will be used"). Because it refers to ticker as the column and not the ticker value in the unique row?

I also tried making a cartesian product and filtering afterwards but got stopped by hardware restrictions, as the data is too big and R tells me that it can not allocate the 24Gb dataframe. Data.table solutions are therefore preferred.

Did I do some rookie mistakes? How would you approach the problem? I hope the example I provided is reproducible as intended and I am happy for feedback of all sorts.

1 Answers1

1

You can use -

library(data.table)

setDT(final_arrange)[, .SD[Event_start - shift(Event_end) > 0 | seq_len(.N) == 1], ticker]

#    ticker Event_start  Event_end
# 1:    AAP  2018-11-23 2018-12-03
# 2:    AAP  2019-02-14 2019-02-24
# 3:    AAP  2019-03-07 2019-03-17
# 4:    AAP  2019-05-17 2019-05-27
# 5:    AAP  2019-08-22 2019-09-01
# 6:    AAP  2019-11-07 2019-11-17
# 7:    AAP  2020-02-13 2020-02-23
# 8:    AAP  2020-05-14 2020-05-24
# 9:    AAP  2020-06-05 2020-06-15
#10:   AAPL  2018-07-04 2018-07-14
#11:   AAPL  2018-08-01 2018-08-11
#12:   EFSC  2020-04-15 2020-04-25
#13:   EFSC  2020-07-15 2020-07-25
#14:    EFX  2018-07-06 2018-07-16
#15:    EFX  2018-07-20 2018-07-30
#16:    EFX  2018-08-03 2018-08-13

Or with dplyr -

library(dplyr)

final_arrange %>% 
  arrange(ticker, Event_start) %>%
  group_by(ticker) %>%
  filter(Event_start - lag(Event_end) > 0 | row_number() == 1)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks Ronak for the simple and straght forward solution! The result is looking good, but strangely one row is missing in your result: AAPL 2018-07-15 2018-07-25. Yet Event_start-lag(Event_end)>0 is true, so it should not have been omitted. – TribokThorsten Oct 12 '21 at 10:43