2

after a lot fo thinking and googling I could not find the solution to my problem, I hope you can help me.

I have a large data frame with an ID column that can repeat more than 2 times, a start and and end date column that would make up a time period. I would like to find out, grouping by ID, if any of the time periods for that ID overlap with another one, and if so, flag it by creating a new column for example, saying if that ID has overlaps or not.

Here is an example data frame already with the desired new column:

structure(list(ID= c(34L, 34L, 80L, 80L, 81L, 81L, 81L, 94L, 
94L), Start = structure(c(1072911600, 1262300400, 1157061600, 
1277935200, 1157061600, 1277935200, 1157061600, 1075590000, 1285891200
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1262214000, 
1409436000, 1251669600, 1404079200, 1251669600, 1404079200, 1251669600, 
1264892400, 1475193600), class = c("POSIXct", "POSIXt"), tzone = ""), 
    Overlap = c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, 
    FALSE, FALSE)), .Names = c("ID", "Start", "End", "Overlap"
), row.names = c(NA, -9L), class = "data.frame")


 ID               Start                 End Overlap
 34 2004-01-01 00:00:00 2009-12-31 00:00:00   FALSE
 34 2010-01-01 00:00:00 2014-08-31 00:00:00   FALSE
 80 2006-09-01 00:00:00 2009-08-31 00:00:00   FALSE
 80 2010-07-01 00:00:00 2014-06-30 00:00:00   FALSE
 81 2006-09-01 00:00:00 2009-08-31 00:00:00    TRUE
 81 2010-07-01 00:00:00 2014-06-30 00:00:00    TRUE
 81 2006-09-01 00:00:00 2009-08-31 00:00:00    TRUE
 94 2004-02-01 00:00:00 2010-01-31 00:00:00   FALSE
 94 2010-10-01 02:00:00 2016-09-30 02:00:00   FALSE

In this case, for ID "81" there is an overlap between two time periods, so I would like to flag all rows with ID = 81 as TRUE, meaning that an overlap in at least two rows of that ID was found. This is just a desired solution, but in general, all I want to do is find out the overlaps when grouping by ID, so the way of flagging it can be flexible, in case it simplifies things.

Thanks in advance for any help.

ANieder
  • 223
  • 5
  • 15
  • What have you already tried? Where did you get stuck? – Eric Fail Jan 22 '16 at 16:25
  • Check `foverlaps` from data.table package. [Here's a post to get you started](http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window/25655497#25655497). – Arun Jan 22 '16 at 17:43

3 Answers3

5

Another option - assuming df contains your data frame, then:

library(data.table)
dt <- data.table(df, key=c("Start", "End"))[, `:=`(Overlap=NULL, row=1:nrow(df))]
overlapping <- unique(foverlaps(dt, dt)[ID==i.ID & row!=i.row, ID])
dt[, `:=`(Overlap=FALSE, row=NULL)][ID %in% overlapping, Overlap:=TRUE][order(ID, Start)]
#    ID               Start                 End Overlap
# 1: 34 2004-01-01 00:00:00 2009-12-31 00:00:00   FALSE
# 2: 34 2010-01-01 00:00:00 2014-08-31 00:00:00   FALSE
# 3: 80 2006-09-01 00:00:00 2009-08-31 00:00:00   FALSE
# 4: 80 2010-07-01 00:00:00 2014-06-30 00:00:00   FALSE
# 5: 81 2006-09-01 00:00:00 2009-08-31 00:00:00    TRUE
# 6: 81 2006-09-01 00:00:00 2009-08-31 00:00:00    TRUE
# 7: 81 2010-07-01 00:00:00 2014-06-30 00:00:00    TRUE
# 8: 94 2004-02-01 00:00:00 2010-01-31 00:00:00   FALSE
# 9: 94 2010-10-01 02:00:00 2016-09-30 02:00:00   FALSE
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Do you have any similar solution using the `dplyr` equivalent package? – David Leal Mar 01 '17 at 17:55
  • 1
    No sorry, afaik there's no `foverlaps` pendant in dplyr. – lukeA Mar 01 '17 at 18:41
  • I am trying to use your solution for a very similar case, just that I don't need to mark all the rows for a given `ID` instead only the rows with overlaps. For this particular example Overlap would be `TRUE` only for rows: 5,6 but not 7. **Is there any quick modification to your answer for getting this?** In this [post](http://stackoverflow.com/questions/325933/determine-whether-two-date-ranges-overlap) it is a good discussion about date overlapping, condition they propose this formulation: `max(startA, startB) <= min(endA, endB)`. I don't know if it would better than `foverlaps`. Any thoughts – David Leal Mar 01 '17 at 23:35
  • @DavidLeal did you manage to do this as per your comment? I am trying to overcome this problem – Salmo salar Feb 13 '18 at 22:22
  • @SalmoSalar, not really. Finally, I found a way to use it in the way it was proposed by @LujeA. A `TRUE` means this row has at least an overlap with any other row that has the same `ID`, it represents both direction relationship. I wanted just to have a one direction relationship, but this is mathematically more difficult to represent, and manipulating the dataset it is possible to overcome. – David Leal Feb 15 '18 at 19:03
0

I think this is the code that you are looking for? Let me know.

data<- structure(list(ID= c(34L, 34L, 80L, 80L, 81L, 81L, 81L, 94L, 
                            94L), Start = structure(c(1072911600, 1262300400, 1157061600, 
                                                      1277935200, 1157061600, 1277935200, 1157061600, 1075590000, 1285891200
                            ), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1262214000, 
                                                                                              1409436000, 1251669600, 1404079200, 1251669600, 1404079200, 1251669600, 
                                                                                              1264892400, 1475193600), class = c("POSIXct", "POSIXt"), tzone = ""), 
                      Overlap = c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, 
                                  FALSE, FALSE)), .Names = c("ID", "Start", "End", "Overlap"
                                  ), row.names = c(NA, -9L), class = "data.frame")

library("dplyr")
library("lubridate")

overlaps<- function(intervals){
        for(i in 1:(length(intervals)-1)){
                for(j in (i+1):length(intervals)){
                        if(int_overlaps(intervals[i],intervals[j])){
                                return(TRUE)
                        }
                }
        }
        return(FALSE)
}

data %>%
        mutate(Interval=interval(Start,End))%>%
        group_by(ID) %>% 
       do({
               df<-.
               ovl<- overlaps(df$Interval)
               return(data.frame(ID=df$ID[1], ovl))
       })

Also, I hope that someone comes up with a more elegant solution to my overlaps function..

AlexT
  • 144
  • 8
  • Thanks, this worked perfectly. For some reason lukeA's solution didn't work, even though it looked like a simpler solution – ANieder Feb 04 '16 at 15:53
0

The ivs package is perfect for working with interval data like this. iv_groups() will merge overlapping intervals, so you can use that to determine if any groups had overlapping intervals.

library(ivs)
library(dplyr)

df <- tribble(
  ~ID, ~Start, ~End, ~ExpectedOverlap,
  34, "2004-01-01 00:00:00", "2009-12-31 00:00:00",   FALSE,
  34, "2010-01-01 00:00:00", "2014-08-31 00:00:00",   FALSE,
  80, "2006-09-01 00:00:00", "2009-08-31 00:00:00",   FALSE,
  80, "2010-07-01 00:00:00", "2014-06-30 00:00:00",   FALSE,
  81, "2006-09-01 00:00:00", "2009-08-31 00:00:00",    TRUE,
  81, "2010-07-01 00:00:00", "2014-06-30 00:00:00",    TRUE,
  81, "2006-09-01 00:00:00", "2009-08-31 00:00:00",    TRUE,
  94, "2004-02-01 00:00:00", "2010-01-31 00:00:00",   FALSE,
  94, "2010-10-01 02:00:00", "2016-09-30 02:00:00",   FALSE
)
df <- mutate(df, Start = as.POSIXct(Start, tz = "UTC"))
df <- mutate(df, End = as.POSIXct(End, tz = "UTC"))

df <- df %>%
  mutate(Range = iv(Start, End), .keep = "unused", .after = ID) %>%
  group_by(ID)

df
#> # A tibble: 9 × 3
#> # Groups:   ID [4]
#>      ID                                      Range ExpectedOverlap
#>   <dbl>                                 <iv<dttm>> <lgl>          
#> 1    34 [2004-01-01 00:00:00, 2009-12-31 00:00:00) FALSE          
#> 2    34 [2010-01-01 00:00:00, 2014-08-31 00:00:00) FALSE          
#> 3    80 [2006-09-01 00:00:00, 2009-08-31 00:00:00) FALSE          
#> 4    80 [2010-07-01 00:00:00, 2014-06-30 00:00:00) FALSE          
#> 5    81 [2006-09-01 00:00:00, 2009-08-31 00:00:00) TRUE           
#> 6    81 [2010-07-01 00:00:00, 2014-06-30 00:00:00) TRUE           
#> 7    81 [2006-09-01 00:00:00, 2009-08-31 00:00:00) TRUE           
#> 8    94 [2004-02-01 00:00:00, 2010-01-31 00:00:00) FALSE          
#> 9    94 [2010-10-01 02:00:00, 2016-09-30 02:00:00) FALSE

# `iv_groups()` merges all overlaps and returns the merged intervals
df %>%
  summarise(Range = iv_groups(Range), .groups = "drop")
#> # A tibble: 8 × 2
#>      ID                                      Range
#>   <dbl>                                 <iv<dttm>>
#> 1    34 [2004-01-01 00:00:00, 2009-12-31 00:00:00)
#> 2    34 [2010-01-01 00:00:00, 2014-08-31 00:00:00)
#> 3    80 [2006-09-01 00:00:00, 2009-08-31 00:00:00)
#> 4    80 [2010-07-01 00:00:00, 2014-06-30 00:00:00)
#> 5    81 [2006-09-01 00:00:00, 2009-08-31 00:00:00)
#> 6    81 [2010-07-01 00:00:00, 2014-06-30 00:00:00)
#> 7    94 [2004-02-01 00:00:00, 2010-01-31 00:00:00)
#> 8    94 [2010-10-01 02:00:00, 2016-09-30 02:00:00)

# So if the length of what `iv_groups()` returns is different
# from the original group size, then we have merged at least 1
# interval, meaning something overlaps
df %>%
  mutate(Overlap = length(iv_groups(Range)) != n())
#> # A tibble: 9 × 4
#> # Groups:   ID [4]
#>      ID                                      Range ExpectedOverlap Overlap
#>   <dbl>                                 <iv<dttm>> <lgl>           <lgl>  
#> 1    34 [2004-01-01 00:00:00, 2009-12-31 00:00:00) FALSE           FALSE  
#> 2    34 [2010-01-01 00:00:00, 2014-08-31 00:00:00) FALSE           FALSE  
#> 3    80 [2006-09-01 00:00:00, 2009-08-31 00:00:00) FALSE           FALSE  
#> 4    80 [2010-07-01 00:00:00, 2014-06-30 00:00:00) FALSE           FALSE  
#> 5    81 [2006-09-01 00:00:00, 2009-08-31 00:00:00) TRUE            TRUE   
#> 6    81 [2010-07-01 00:00:00, 2014-06-30 00:00:00) TRUE            TRUE   
#> 7    81 [2006-09-01 00:00:00, 2009-08-31 00:00:00) TRUE            TRUE   
#> 8    94 [2004-02-01 00:00:00, 2010-01-31 00:00:00) FALSE           FALSE  
#> 9    94 [2010-10-01 02:00:00, 2016-09-30 02:00:00) FALSE           FALSE
Davis Vaughan
  • 2,780
  • 9
  • 19