I am trying to find overlapping periods for a data frame with two identifiers, which differs from the other questions which have been asked on this topic where overlapping observations had to be counted for only one identifier.
In my data frame, every transaction has a unique ID and every firm which is transacted in has a unique firm identifier. I am trying to find overlapping periods per firm identifier, counted per transaction ID.
I used the code from @Waldi to create this reproducible example, found in this topic (I only added a firm identifier or "FirmID" to the sample data frame):
Code to create a similar dataset:
library (data.table)
size = 1e5
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
FirmID = sample(1:20000),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
The code to find overlapping periods:
dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
,.(noverlap=.N),by=.(FirmID,row)][
,.(overlap = max(noverlap>1)),by=FirmID][
,.(n=.N),by=.(overlap)][
,pct:=n/sum(n)][]
The only problem with this code, is that it displays the number of overlapping firms, instead of the number of overlapping transactions within a firm identifier.
overlap n pct
1: 0 5333 0.26665
2: 1 14667 0.73335
How can this code be altered for overlapping transaction IDs within a firm identifier? One change I made myself did not yield satisfactory results:
setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
,.(noverlap=.N),by=.(ID,row)][
,.(overlap = max(noverlap>1)),by=ID][
,.(n=.N),by=.(overlap)][
,pct:=n/sum(n)][]
overlap n pct
1: 0 5333 0.26665
2: 1 14667 0.73335