Separating non-overlapping intervals within groups and counting in R

Question

Using R, I have inpatient data that I have grouped by DNA strain (of the pathogen), clinic of inpatient stay, and overlapping timeframe of the stay to determine if transmission is possible.

I need to sequentially number the overlapping groups. This would appear quite simple, but two issues:

Everything I have found on SO or elsewhere talks about numbering rows within groups. I need each row in a group the same number and the groups themselves to be counted.
Whatever approach would accomplish that initially seemed simple enough with a %>% group_by(strain, clinic) %>%, but this doesn't account for non-overlapping time intervals

I have tried several approaches and search before finally giving in and posting here (none of my attempts are worthy of event posting here to waste your time.) The below code is an example of the data I have (have) and data I want (want). Note for strain B, all patients are in Clinic_1 but there are two groups due to a separation in time intervals.

Any advice would be much appreciated.

have <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
                   strain=c(rep("A",4),rep("B",5)),
                   clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
                   datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/21","2020/03/06","2020/03/18")),
                   dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/22","2020/03/21","2020/03/22"))
                   )

want <- data.frame(have,overlap_number=c(1,1,2,2,3,3,3,4,4))

#How the final data would look
> View(want)
   id strain   clinic     datein    dateout overlap_number
1 K01      A Clinic_1 2020-01-01 2020-01-05              1
2 K02      A Clinic_1 2020-01-03 2020-01-16              1
3 K03      A Clinic_2 2020-02-03 2020-02-09              2
4 K04      A Clinic_2 2020-02-09 2020-02-19              2
5 K05      B Clinic_1 2020-02-18 2020-02-27              3
6 K06      B Clinic_1 2020-02-20 2020-02-23              3
7 K07      B Clinic_1 2020-02-21 2020-02-22              3
8 K08      B Clinic_1 2020-03-06 2020-03-21              4
9 K09      B Clinic_1 2020-03-18 2020-03-22              4

An alternative dataset based on Akrun's comment, changing dates slightly for K07:

have2 <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
                   strain=c(rep("A",4),rep("B",5)),
                   clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
                   datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/25","2020/03/06","2020/03/18")),
                   dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/29","2020/03/21","2020/03/22"))
                   )

#Output:

#> have2 %>% 
#+   mutate(overlap_number = rleid(strain, clinic, 
#+                                 cumsum(datein > lag(dateout, default = #first(dateout)))))
#   id strain   clinic     datein    dateout overlap_number
#1 K01      A Clinic_1 2020-01-01 2020-01-05              1
#2 K02      A Clinic_1 2020-01-03 2020-01-16              1
#3 K03      A Clinic_2 2020-02-03 2020-02-09              2
#4 K04      A Clinic_2 2020-02-09 2020-02-19              2
#5 K05      B Clinic_1 2020-02-18 2020-02-27              3
#6 K06      B Clinic_1 2020-02-20 2020-02-23              3
#7 K07      B Clinic_1 2020-02-25 2020-02-29              4 ## treats this as single, should be 3
#8 K08      B Clinic_1 2020-03-06 2020-03-21              5 ## should be 4
#9 K09      B Clinic_1 2020-03-18 2020-03-22              5 ## should be 4

score 1 · Answer 1 · answered Apr 16 '20 at 18:34

1

We can use rleid

library(dplyr)
library(data.table)
have %>% 
  mutate(overlap_number = rleid(strain, clinic, 
       cumsum(datein > lag(dateout, default = first(dateout)))))
#    id strain   clinic     datein    dateout overlap_number
#1 K01      A Clinic_1 2020-01-01 2020-01-05              1
#2 K02      A Clinic_1 2020-01-03 2020-01-16              1
#3 K03      A Clinic_2 2020-02-03 2020-02-09              2
#4 K04      A Clinic_2 2020-02-09 2020-02-19              2
#5 K05      B Clinic_1 2020-02-18 2020-02-27              3
#6 K06      B Clinic_1 2020-02-20 2020-02-23              3
#7 K07      B Clinic_1 2020-02-21 2020-02-22              3
#8 K08      B Clinic_1 2020-03-06 2020-03-21              4
#9 K09      B Clinic_1 2020-03-18 2020-03-22              4

answered Apr 16 '20 at 18:34

akrun

874,273
37
540
662

Beautiful - was unaware of the rleid function and works perfectly on example data. One slight hiccup when I apply it to my actual data - groups with 3+ sometimes have a long interval that encompasses several others disjointed dates. For an example of this, I edited in another dataset which changed `K07`'s dates. I tried changing `...>lag(dateout...)` to `...>min(dateout...)` but didnt work. Any other ideas? – jpsmith Apr 16 '20 at 19:33
@jpsmith is the same overlap number you needed – akrun Apr 16 '20 at 19:42
I added my output and notes; when I run it on want2 it creates 5 overlap numbers as it treats `K07` as a single unit – jpsmith Apr 16 '20 at 19:46

chinsoon12 · Accepted Answer · 2020-04-19T23:20:41.940

An option using data.table:

setkey(setDT(have), clinic, strain, datein, dateout)
have[, g := cumsum(c(0L, (shift(datein, -1L) > cummax(as.integer(dateout)))[-.N])), 
  .(clinic, strain)][, 
    g := rleid(clinic, strain, g)]

Also:

have[, g02 := cumsum(datein > shift(cummax(as.integer(dateout)), fill=dateout[1L])),
    .(clinic, strain)][,
        g2 := rleid(clinic, strain, g02)]

output:

    id strain   clinic     datein    dateout g g2
1: K01      A Clinic_1 2020-01-01 2020-01-05 1  1
2: K02      A Clinic_1 2020-01-03 2020-01-16 1  1
3: K05      B Clinic_1 2020-02-18 2020-02-27 2  2
4: K06      B Clinic_1 2020-02-20 2020-02-23 2  2
5: K07      B Clinic_1 2020-02-21 2020-02-22 2  2
6: K08      B Clinic_1 2020-03-06 2020-03-21 3  3
7: K09      B Clinic_1 2020-03-18 2020-03-22 3  3
8: K03      A Clinic_2 2020-02-03 2020-02-09 4  4
9: K04      A Clinic_2 2020-02-09 2020-02-19 4  4

Idea on the cummax came from David Aurenburg post: How to flatten / merge overlapping time periods

thank you for great insight - in this solution, K07 is grouped with K08 and K09, but should be grouped with K05/K06. I tried several alternatives based on your approach but couldn't quite get it to work. Any ideas? Thank you for your help so far! — jpsmith, Apr 17 '20 at 14:10

Separating non-overlapping intervals within groups and counting in R

2 Answers2