How to compare rows in a grouped tibble with different methods for the first and the last row of the every group? (In R)

Question

this is my first question here. So please be merciful with me :-). I am grateful for any suggestions for improvement.

Here is my question:

I have a grouped and sorted tibble and I want to go through row by row and make comparisons with both the previous row and the next row. For this I have to treat the first and the last row differently. Each row has an interval defined by "start" and "end" and I want to find out if these intervals overlap.

The "if" function does not do what what I want it to do. I have tried a lot of things but I can't get a result. Maybe there is a much simpler solution.

Thanks in advance for helping!

library(tidyverse)
library(lubridate)
ID <- c(1, 1, 1, 1, 2, 2, 2, 2)
start <- ymd_hms(c("2022-04-15 10:10:00", "2022-04-15 10:15:00", "2022-04-15 10:35:00", "2022-04-15 10:50:00", "2022-04-15 11:20:00", "2022-04-15 11:35:00", "2022-04-15 11:45:00", "2022-04-15 11:50:00"))
end   <- ymd_hms(c("2022-04-15 10:20:00", "2022-04-15 10:30:00", "2022-04-15 10:40:00", "2022-04-15 10:55:00", "2022-04-15 11:30:00", "2022-04-15 11:40:00", "2022-04-15 11:55:00", "2022-04-15 11:55:00"))

data <- tibble(ID, start, end)

data %>% 
  group_by(ID) %>%
  arrange(start, .by_group = TRUE) %>%
  mutate(overlap = {
    if (row_number()==1) {lead(start) < end
    } else if (row_number()==n()) {start < lag(end)
    } else {(lead(start) < end) | (start < lag(end))}
  }) 


# A tibble: 8 x 4
# Groups:   ID [2]
     ID start               end                 overlap
  <dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 FALSE  
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 NA     
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 NA

My expected output would be:

# A tibble: 8 x 4
# Groups:   ID [2]
ID start               end                 overlap
<dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE  
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE     
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE

FYI, using `if` in this way is wrong, see https://stackoverflow.com/q/14170778/3358272, https://stackoverflow.com/q/10374932/3358272 — r2evans, Apr 21 '22 at 13:27

score 1 · Answer 1 · answered Apr 21 '22 at 13:36

if requires that its condition argument be exactly length 1, which is rarely the case in mutate, and especially here since we know we're dealing with more than 1 row at a time. Alternatives often include base::ifelse or dplyr::if_else or similar, though they are not quite right here.

I suggest since you want to look at previous and next rows, we should use dplyr's lag and lead functions, respectively. Since we need to treat the first/last rows specially, we can use vectorized boolean logic with & and |.

For dealing with the first/last rows specially, we have two options:

comparing row_number() with 1 and n() for the first and last rows, respectively:

data %>%
  group_by(ID) %>%
  arrange(start, .by_group = TRUE) %>%
  mutate(
    overlap = (row_number() > 1 & start < lag(end)) |
                (row_number() < n() & end > lead(start))
  ) %>%
  ungroup()
# # A tibble: 8 x 4
#      ID start               end                 overlap
#   <dbl> <dttm>              <dttm>              <lgl>  
# 1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
# 2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
# 3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
# 4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
# 5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
# 6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
# 7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
# 8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE

set default= in lag/lead such that we get the desired intent.

data %>%
  group_by(ID) %>%
  arrange(start, .by_group = TRUE) %>%
  mutate(
    overlap = (start < lag(end, default = start[1] - 1)) |
                (end > lead(start, default = end[n()] + 1))
  ) %>%
  ungroup()
# # A tibble: 8 x 4
#      ID start               end                 overlap
#   <dbl> <dttm>              <dttm>              <lgl>  
# 1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
# 2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
# 3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
# 4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
# 5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
# 6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
# 7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
# 8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE

Both methods are using strict-inequality (>/<), though you can change to >= (etc) just as easily depending on your needs.

score 1 · Answer 2 · answered Apr 21 '22 at 14:27

Here is another strategy using lubridates %within% and interval functions:

library(dplyr)
library(lubridate)

data %>% 
  group_by(ID) %>% 
  mutate(x = interval(start, end),
         y = start %within% lag(x),
         z = lead(y),
         overlap = case_when(y==TRUE | z==TRUE ~TRUE,
                              TRUE ~ FALSE)) %>% 
  select(-c(x,y,z))

     ID start               end                 overlap
  <dbl> <dttm>              <dttm>              <lgl>  
1     1 2022-04-15 10:10:00 2022-04-15 10:20:00 TRUE   
2     1 2022-04-15 10:15:00 2022-04-15 10:30:00 TRUE   
3     1 2022-04-15 10:35:00 2022-04-15 10:40:00 FALSE  
4     1 2022-04-15 10:50:00 2022-04-15 10:55:00 FALSE  
5     2 2022-04-15 11:20:00 2022-04-15 11:30:00 FALSE  
6     2 2022-04-15 11:35:00 2022-04-15 11:40:00 FALSE  
7     2 2022-04-15 11:45:00 2022-04-15 11:55:00 TRUE   
8     2 2022-04-15 11:50:00 2022-04-15 11:55:00 TRUE

@TarJar I have also made some attempts with "lubridate::interval". I have always tried to compare the overlap of an interval with all intervals in the group. which also has its advantages. For example, if interval 1 has an overlap with interval 2, but interval 2 is much shorter, so that interval 3 possibly overlaps with interval 1 but not with interval 2, then "False" is returned for interval 3. This case should be very rare in my data set, but it is not impossible. This possibility became clear to me only now. So how could I compare each interval with all intervals of the group? — Rory, Apr 25 '22 at 12:38

How to compare rows in a grouped tibble with different methods for the first and the last row of the every group? (In R)

2 Answers2