R Overlapping period for event with two identifiers

Question

I am trying to find overlapping periods for a data frame with two identifiers, which differs from the other questions which have been asked on this topic where overlapping observations had to be counted for only one identifier.

In my data frame, every transaction has a unique ID and every firm which is transacted in has a unique firm identifier. I am trying to find overlapping periods per firm identifier, counted per transaction ID.

I used the code from @Waldi to create this reproducible example, found in this topic (I only added a firm identifier or "FirmID" to the sample data frame):

How to find the proportion of IDs with overlapping interval dates in very large datasets in R

Code to create a similar dataset:

library (data.table)

size = 1e5

df <- data.frame(
  ID = sample(1:round(size / 5, 0)),
  FirmID = sample(1:20000),
  period = sample(c(5,10,30,45), size, replace = TRUE),
  start = sample(seq(
    as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
  ), size, replace = TRUE)
) %>% mutate(end = start + period)

The code to find overlapping periods:

dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]

setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
  ,.(noverlap=.N),by=.(FirmID,row)][
    ,.(overlap = max(noverlap>1)),by=FirmID][
      ,.(n=.N),by=.(overlap)][
        ,pct:=n/sum(n)][]

The only problem with this code, is that it displays the number of overlapping firms, instead of the number of overlapping transactions within a firm identifier.

overlap     n     pct
1:       0  5333 0.26665
2:       1 14667 0.73335

How can this code be altered for overlapping transaction IDs within a firm identifier? One change I made myself did not yield satisfactory results:

setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
  ,.(noverlap=.N),by=.(ID,row)][
    ,.(overlap = max(noverlap>1)),by=ID][
      ,.(n=.N),by=.(overlap)][
        ,pct:=n/sum(n)][]

   overlap     n     pct
1:       0  5333 0.26665
2:       1 14667 0.73335

Alexis · Accepted Answer · 2022-04-07T16:29:17.633

I had a hard time wrapping my head around the solution for the other scenario, so I'm not sure how to adjust that version, but I believe you can achieve the result you want in this case with an inner non-equi join:

dt <- data.table(df, key = c("FirmID", "ID", "start", "end"))
dt[, firm_total := .N, by = "FirmID"][
    dt, .(FirmID, firm_total), on = .(FirmID, ID < ID, start <= end, end >= start), nomatch = NULL, mult = "all"][
        , .(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2L)), by = "FirmID"]

We avoid redundant rows in the join with ID < ID (and note that here that should be interpreted as left-hand side column's ID < right-hand side column's ID). Redundancy can happen because, if ID x overlaps with ID y, y overlaps with x.

If you think of all pairs of IDs and put them in a matrix, the maximum number of overlaps would be the number of elements in the lower triangular, which can be calculated with n * (n - 1) / 2, that's why we initially add firm_total.

I didn't do extensive testing but this version may be better for this scenario. The documentation of foverlaps states that it's mainly targeted at joins where one table is much smaller than the other one, and explains why it may be an expensive operation. You're doing a self-join, so both tables are the same size.

And a table.express version of the solution because why not:

library(table.express)

dt %>%
    group_by(FirmID) %>%
    mutate(firm_total = .N) %>%
    inner_join(dt, FirmID, ID < ID, start <= end, end >= start, .expr = TRUE) %>%
    select(FirmID, firm_total) %>%
    group_by(FirmID) %>%
    summarize(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2))

EDIT: and if you want your overlap column you could compute that with size of lower triangular - n.

Thanks, the code is working perfectly! – M1ke Apr 07 '22 at 20:53 — M1ke, Apr 07 '22 at 20:53

R Overlapping period for event with two identifiers

1 Answers1