4

I am working with a dataframe where each row is indexed by a date. These are ordered but incomplete, meaning that that there is a subset of uninterrupted sequential rows in the dataset. I am interested in grouping the rows that belong to a sequence and finding out the length (number of days) of each sequence. Rows that are not part of a sequence are disregarded (or could have length of 1). Consider the following toy data:

library(tidyverse)
library(lubridate)

# toy data
df <- tibble(day = as_date(c("2022-01-01", "2022-01-03", "2022-01-04", 
                             "2022-01-05", "2022-01-08","2022-01-10",
                             "2022-01-11", "2022-01-12","2022-01-13")),
             values = 1:9)

What I am trying to obtain is an aggregated dataframe that tells me each uninterrupted sequence of days in the data and their length such as this:

# ideal output
tibble(sequences = c("2022-01-03 - 2022-01-05", "2022-01-10 - 2022-01-12"),
       length = c(3, 4))

So far I have calculated the differences in days between each row by using lags and leads and then turned this into an additional column indicating whether a row is the start, mid or endpoint of an uninterrupted sequence of days:

# transform df
df %>% 
  # add lag and lead to find day differences
  mutate(last_day = lag(day), 
         next_day = lead(day),
         diff_last = day - last_day,
         diff_next = next_day - day) %>% 
  # flag days in sequence
  mutate(in_seq = if_else(
    condition = diff_last == 1| diff_next == 1, 
    true = TRUE, 
    false = FALSE)) %>% 
  # find position of days in sequence
  mutate(seq_position = case_when(
    diff_last > 1 & diff_next == 1 ~ "start",
    diff_last == 1 & diff_next > 1 ~ "end",
    diff_last == 1 & diff_next == 1 ~ "mid",
    is.na(diff_last) & diff_next == 1 ~ "start",
    diff_last == 1 & is.na(diff_next) ~ "end",
    is.na(diff_last) & is.na(diff_next) ~ NA_character_,
    TRUE ~ "none"
  ))

This is where I'm stuck. I suspect that I need to create an additional variable for grouping that contains the start and end of each sequence so that I can summarise the number of rows per group, but I do not know a way for creating such groups since it involves creating a string based on values from multiple columns.

O René
  • 305
  • 1
  • 12
  • 1
    Create a sequence id that increments by 1 each time `seq_position` is `"start"`. Then `group_by(sequence_id) %>% summarise()` to get the first and last dates for each `sequence_id`. everything after that is just string manipulation. – Limey Aug 02 '22 at 11:14
  • this sounds like what I vaguely had in mind with creating a grouping variable and then summarising. How would you go about creating the ID that increments whenever 'start' is present? – O René Aug 02 '22 at 11:16
  • 2
    For the grouping variable: [Create grouping variable for consecutive sequences and split vector](https://stackoverflow.com/questions/5222061/create-grouping-variable-for-consecutive-sequences-and-split-vector); i.e. `cumsum(c(1L, diff(day) != 1))`. Then find first, last, length et c by group. – Henrik Aug 02 '22 at 11:28

1 Answers1

4

You could do:

df %>% 
  group_by(cumsum(c(0, diff(day) - 1))) %>%
  summarise(sequences = paste(first(day), last(day), sep = ' - '),
            length    = n()) %>%
  filter(length > 1) %>%
  select(sequences, length)

#> # A tibble: 2 x 2
#>   sequences               length
#>   <chr>                    <int>
#> 1 2022-01-03 - 2022-01-05      3
#> 2 2022-01-10 - 2022-01-13      4
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • beautiful, thank you. What does the grouping by `cumsum`do exactly? – O René Aug 02 '22 at 12:25
  • 1
    If you do `diff(day)`, then consecutive days will be given a 1, and non-consecutive days will be given a number greater than 1. So `diff(day) - 1` will be 0 on non-consecutive days. Therefore, `cumsum(diff(day) - 1)` will give the same integer value to groups of consecutive days. You need to add a zero at the start, since a `diff` is one element longer than the original column. Grouping by this result just means you are grouping by the groups of consecutive days. – Allan Cameron Aug 02 '22 at 13:00