4

I need to remove rows with overlapped dates and keep the x value which is maximum among the overlapped dates. Here is a data frame:

data.frame(time_left = c("2011-08-05",
"2011-07-25",
"2017-08-20",
"2017-08-20", 
"2017-10-09", 
"2019-06-01"), 
time_right= c("2011-09-14",
"2011-09-01",
"2017-09-12",
"2017-09-26",
"2017-10-15",
"2019-11-05"),
x = c(114,20,10,1,5,100) ) -> df

so my input is:

   time_left time_right   x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01  20
3 2017-08-20 2017-09-12  10
4 2017-08-20 2017-09-26   1
5 2017-10-09 2017-10-15   5
6 2019-06-01 2019-11-05 100

and my desired output is:

  time_left   time_right   x
1 2011-08-05 2011-09-14  114
2 2011-07-25 2011-09-01   20
4 2017-08-20 2017-09-26   10  
5 2017-10-09 2017-10-15    5
6 2019-06-01 2019-11-05  100

I appreciate any help.

ADEN
  • 99
  • 4
  • How are you deciding the `time_left/right` variables in the output? What if the times only partially overlap instead of one being entirely within the other? – thelatemail Aug 31 '22 at 22:13
  • 3
    I do not want to remove partially overlapped records. Only fully overlapped records need to be removed – ADEN Aug 31 '22 at 22:23
  • You might want to make that clear within the question by adding a partially overlapping example that remains. – thelatemail Aug 31 '22 at 23:50
  • I edited the question and added a partially overlapped date which should not be removed. – ADEN Sep 01 '22 at 01:54

3 Answers3

2

@Maël brought this issue to my attention over on the ivs issue page https://github.com/DavisVaughan/ivs/issues/20.

I think this can be very elegantly and efficiently solved with ivs, but it is a bit hard to come up with the solution, so I'll probably add a helper to do this more easily.

This solution works with "recursive" containers too, i.e. where range A contains range B, but then range C also contains range A, so you really only want to list range C. I've described this in more detail with examples here https://github.com/DavisVaughan/ivs/issues/20#issuecomment-1234479783.

library(ivs)
library(dplyr)
library(vctrs)

df <- tibble(
  time_left = as.Date(c(
    "2011-08-05", "2011-07-25", "2017-08-20",
    "2017-08-20", "2017-10-09", "2019-06-01"
  )),
  time_right = as.Date(c(
    "2011-09-14", "2011-09-01", "2017-09-12",
    "2017-09-26", "2017-10-15", "2019-11-05"
  )),
  x = c(114, 20, 10, 1, 5, 100)
)

df <- df %>% 
  mutate(range = iv(time_left, time_right), .keep = "unused")

df
#> # A tibble: 6 × 2
#>       x                    range
#>   <dbl>               <iv<date>>
#> 1   114 [2011-08-05, 2011-09-14)
#> 2    20 [2011-07-25, 2011-09-01)
#> 3    10 [2017-08-20, 2017-09-12)
#> 4     1 [2017-08-20, 2017-09-26)
#> 5     5 [2017-10-09, 2017-10-15)
#> 6   100 [2019-06-01, 2019-11-05)

iv_locate_max_containment <- function(x) {
  # Find all locations where the range "contains" any other range
  # (including itself)
  locs <- iv_locate_overlaps(x, x, type = "contains")
  
  # Find the "top" ranges, i.e. the containers that aren't contained
  # by any other containers
  top <- !vec_duplicate_detect(locs$haystack)
  top <- vec_slice(locs$haystack, top)
  top <- vec_in(locs$needles, top)
  
  locs <- vec_slice(locs, top)
  
  locs
}

# i.e. row 4 "contains" rows 3 and 4
locs <- iv_locate_max_containment(df$range)
locs
#>   needles haystack
#> 1       1        1
#> 2       2        2
#> 3       4        3
#> 4       4        4
#> 5       5        5
#> 6       6        6

iv_align(df$range, df$x, locations = locs) %>%
  rename(range = needles) %>%
  group_by(range) %>%
  summarise(x = max(haystack))
#> # A tibble: 5 × 2
#>                      range     x
#>                 <iv<date>> <dbl>
#> 1 [2011-07-25, 2011-09-01)    20
#> 2 [2011-08-05, 2011-09-14)   114
#> 3 [2017-08-20, 2017-09-26)    10
#> 4 [2017-10-09, 2017-10-15)     5
#> 5 [2019-06-01, 2019-11-05)   100

Created on 2022-09-01 with reprex v2.0.2

Davis Vaughan
  • 2,780
  • 9
  • 19
1

This may sound a little verbose, however, this could also be a solution:

  • First we identify those observations that are potentially overlapped.
  • Then we group the similar ones.
  • In each group we choose the minimum time_left and maximum time_right and x.
library(tidyverse)

df %>%
  mutate(across(starts_with('time'), ymd), 
         intv = interval(time_left, time_right),
         id = row_number()) %>%
  mutate(id2 = map2(intv, id, ~ if (any(.x %within% intv[intv != .x])) {
    id[which(.x %within% intv[intv != .x]) + 1] 
  } else {
    .y
  })) %>%
  group_by(id2) %>%
  summarise(time_left = min(time_left), 
            across(c(time_right, x), max)) %>%
  select(!(id2))


# A tibble: 4 × 3
  time_left  time_right     x
  <date>     <date>     <dbl>
1 2011-08-05 2011-09-14   114
2 2017-08-20 2017-09-26    10
3 2017-10-09 2017-10-15     5
4 2019-06-01 2019-11-05   100
thelatemail
  • 91,185
  • 12
  • 128
  • 188
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    what is "ymd" here? – ADEN Aug 31 '22 at 23:52
  • It's a function from `lubridate` package that parses date in the following character format `yyyy-mm-dd` into Date or POSIXct objects. Depending on how the components are set we also have 'dmy` and `mdy`. – Anoushiravan R Sep 01 '22 at 08:28
1

I combined Anoushiravan's solution with this

How do I determine in R if a date interval overlaps another date interval for the same individual in a data frame?

and I think it is working now.

df %>%
  mutate(id = row_number(), days = as.numeric(as.Date(time_right) -  as.Date(time_left)) ) %>%
  mutate(Int = interval(time_left, time_right), 
         within = map(seq_along(Int), function(x){
           y = setdiff(seq_along(Int), x)
           if(any(id[which((Int[x] %within% Int[y]))+1])){
              return(id[days == max(days[which((Int[x] %within% Int[y]))+1])])
           }else{ return(0)}
         })
  ) %>% 
  mutate(within = ifelse(within > 0 , within, id)) %>% 
  group_by(within) %>% 
  summarise(time_left = min(time_left), time_right = max(time_right), x = max(x)) %>% 
  select(!within)

But it still has some bugs. for the following df, this code will not work unless I change the order of the records.

df = data.frame(time_left = c("2014-01-01", "2014-01-01", "2014-12-01", "2014-12-26"),
     time_right = c("2014-04-23", "2014-12-31", "2014-12-31", "2014-12-31"),
     x = c(10,100,200,20))
ADEN
  • 99
  • 4