I want to filter a data frame named final_arrange that has ~80000 rows filled with events. Each event is assigned to an entity, which is identified with a character stored in the column ticker. Each event also has an event range/window that is specified by the columns Event_start and Event_end. The result I want to get is that I have non-overlapping events only. Meaning that for each entity(ticker) there is just one event at a time without event windows overlapping.
(hopefully)reproducible example:
structure(list(ticker = c("AAP", "AAP", "AAP", "AAP", "AAP",
"AAP", "AAP", "AAP", "AAP", "AAP", "AAP", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "EFSC", "EFSC", "EFX", "EFX", "EFX"), Event_start = structure(c(17858,
17941, 17950, 17962, 18033, 18130, 18207, 18217, 18305, 18396,
18418, 17716, 17717, 17718, 17719, 17720, 17723, 17724, 17725,
17726, 17727, 17731, 17732, 17744, 18367, 18458, 17718, 17732,
17746), class = "Date"), Event_end = structure(c(17868, 17951,
17960, 17972, 18043, 18140, 18217, 18227, 18315, 18406, 18428,
17726, 17727, 17728, 17729, 17730, 17733, 17734, 17735, 17736,
17737, 17741, 17742, 17754, 18377, 18468, 17728, 17742, 17756
), class = "Date")), row.names = c(NA, -29L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000002036a211ef0>, sorted = "Event_start")
I found a great thread with a data.table solution for a similar problem but struggle to tailor the code to my needs. Efficient way to drop rows with overlapping times The difference to my problem is that as a secondary condition I need to check if the current rows ticker has an exact match in a prior or previous one. If they do not have an exact match, a row should not be deleted, even if event windows overlap.
This is the code I got:
result <- setDT(final_arrange, key="Event_start")[!(Event_end >= shift(Event_start, type="lead", fill = T) & ticker == shift(ticker, type="lead", fill = T))
& !(Event_start <= shift(Event_end, type="lag", fill = F) & ticker == shift(ticker, type="lag", fill = T))]
The result it yields is:
ticker Event_start Event_end
AAP 2018-11-23 2018-12-03
AAP 2019-03-07 2019-03-17
AAP 2019-05-17 2019-05-27
AAP 2019-08-22 2019-09-01
AAP 2020-02-13 2020-02-23
AAP 2020-05-14 2020-05-24
AAP 2020-06-05 2020-06-15
AAPL 2018-08-01 2018-08-11
EFSC 2020-04-15 2020-04-25
EFSC 2020-07-15 2020-07-25
EFX 2018-07-06 2018-07-16
EFX 2018-07-20 2018-07-30
EFX 2018-08-03 2018-08-13
What an desired output looks like:
ticker Event_start Event_end
AAP 2018-11-23 2018-12-03
AAP 2019-02-14 2019-02-24
AAP 2019-03-07 2019-03-17
AAP 2019-05-17 2019-05-27
AAP 2019-08-22 2019-09-01
AAP 2019-11-07 2019-11-17
AAP 2020-02-13 2020-02-23
AAP 2020-05-14 2020-05-24
AAP 2020-06-05 2020-06-15
AAPL 2018-07-04 2018-07-14
AAPL 2018-07-15 2018-07-25
AAPL 2018-08-01 2018-08-11
EFSC 2020-04-15 2020-04-25
EFSC 2020-07-15 2020-07-25
EFX 2018-07-06 2018-07-16
EFX 2018-07-20 2018-07-30
EFX 2018-08-03 2018-08-13
There are some rows missing and I struggle to identify what the root of the problem is. Is it because using "==" to check if a row matches the prior or previous columns ticker is not resulting in exact matches only? I did not manage to use grepl (e.g. grepl(ticker, shift(ticker, type="lead", fill=T)) either (error message: "argument 'pattern' has length > 1 and only the first element will be used"). Because it refers to ticker as the column and not the ticker value in the unique row?
I also tried making a cartesian product and filtering afterwards but got stopped by hardware restrictions, as the data is too big and R tells me that it can not allocate the 24Gb dataframe. Data.table solutions are therefore preferred.
Did I do some rookie mistakes? How would you approach the problem? I hope the example I provided is reproducible as intended and I am happy for feedback of all sorts.