Time series summary

Question

I was trying to sum numbers whose time lag is 1. i.e. I would like to summarize the rows by adding the frequencies of values where the days differ only by a single day within a particular group. I used the lag function to get the diff, but not sure how to proceed from here.

df <- df %>% 
  group_by(group) %>% 
  mutate(diff = dt - lag(dt))

df[!is.na(df$diff) & df$diff > 1,]$diff <- NA

For ex:

 group     dt           freq  diff  
 groupA    2016-03-21    1     NA    
 groupA    2016-03-22    1     1     
 groupA    2016-03-23    1     1     
 groupA    2016-03-26    2     NA     
 groupA    2016-03-28    1     NA     
 groupA    2016-03-29    3     1     
 groupA    2016-03-30    3     1     
 groupA    2016-03-31    5     1     
 groupB    2016-04-01    1     NA      
 groupB    2016-04-02    2     1

I need to group this into:

group    dt         freq  diff  duration     
groupA  2016-03-21    1     NA    3 (1 + 1 + 1)     
groupA  2016-03-22    1     1         
groupA  2016-03-23    1     1         
groupA  2016-03-26    2     NA    2     
groupA  2016-03-28    1     NA    12(1 + 3 + 3 + 5)     
groupA  2016-03-29    3     1         
groupA  2016-03-30    3     1         
groupA  2016-03-31    5     1         
groupB  2016-04-01    1     NA    3(1 + 2)     
groupB  2016-04-02    2     1

Also referred to this, but cumulative does not work as I do not consider jumps more than a single day apart. Is looping in a custom function the only way?

M-- · Answer 1 · 2020-10-09T02:45:31.063

You can do it much easier with this approach (grouping rows with less.than 1 day difference); this will create a helper column gap which later will be used to sum the freq for consecutive days in the same group:

library(dplyr)

df %>% 
    mutate(gap = cumsum(!c(TRUE, diff(as.Date(df$dt)) == 1)))  %>% 
    group_by(gap, group) %>% 
    mutate(duration = sum(freq, na.rm=TRUE)) %>% 
    ungroup %>% select(-gap) %>% as.data.frame

#            group         dt freq duration
#        1  groupA 2016-03-21    1        3
#        2  groupA 2016-03-22    1        3
#        3  groupA 2016-03-23    1        3
#        4  groupA 2016-03-26    2        2
#        5  groupA 2016-03-28    1       12
#        6  groupA 2016-03-29    3       12
#        7  groupA 2016-03-30    3       12
#        8  groupA 2016-03-31    5       12
#        9  groupB 2016-04-01    1        3
#        10 groupB 2016-04-02    2        3

Maurits Evers · Accepted Answer · 2018-04-10T02:26:37.210

Here is a tidyverse solution using dplyr::lead:

library(tidyverse);
df %>%
    mutate(dt = as.POSIXct(dt)) %>%
    group_by(group) %>%
    mutate(
        diff = pmin(c(1, diff(dt)), c(1, diff(lead(dt))), na.rm = T),
        id = cumsum(c(TRUE, diff(diff) != 0) | diff > 1)) %>%
    group_by(group, id) %>%
    mutate(duration = sum(freq)) %>%
    ungroup() %>%
    select(-diff, -id)
## A tibble: 10 x 4
#   group  dt                   freq duration
#   <fct>  <dttm>              <int>    <int>
# 1 groupA 2016-03-21 00:00:00     1        3
# 2 groupA 2016-03-22 00:00:00     1        3
# 3 groupA 2016-03-23 00:00:00     1        3
# 4 groupA 2016-03-26 00:00:00     2        2
# 5 groupA 2016-03-28 00:00:00     1       12
# 6 groupA 2016-03-29 00:00:00     3       12
# 7 groupA 2016-03-30 00:00:00     3       12
# 8 groupA 2016-03-31 00:00:00     5       12
# 9 groupB 2016-04-01 00:00:00     1        3
#10 groupB 2016-04-02 00:00:00     2        3

Explanation: diff chooses the minimum difference between the preceding and following date. We then look for changes in diff, and create a new grouping vector id by which we calculate the summary metric sum(freq).

Sample data

df <- read.table(text =
    " group     dt           freq  diff
 groupA    2016-03-21    1     NA
 groupA    2016-03-22    1     1
 groupA    2016-03-23    1     1
 groupA    2016-03-26    2     NA
 groupA    2016-03-28    1     NA
 groupA    2016-03-29    3     1
 groupA    2016-03-30    3     1
 groupA    2016-03-31    5     1
 groupB    2016-04-01    1     NA
 groupB    2016-04-02    2     1 ", header = T)

Update

For your second example:

# Sample data
df <- read.table(text =
" group     dt           freq  diff
groupA    2016-03-21    1     NA
groupA    2016-03-22    1     1
groupA    2016-03-23    1     1
groupA    2016-03-26    2     NA
groupA    2016-03-28    1     NA
groupA    2016-04-01    3     1
groupA    2016-04-02    3     1
groupA    2016-04-03    5     1
groupB    2016-04-01    1     NA
groupB    2016-04-02    2     1 ", header = T)

df %>%
    mutate(dt = as.POSIXct(dt)) %>%
    group_by(group) %>%
    mutate(
        diff = pmin(c(1, diff(dt)), c(1, diff(lead(dt))), na.rm = T),
        id = cumsum(c(TRUE, diff(diff) != 0) | diff > 1)) %>%
    group_by(group, id) %>%
    mutate(duration = sum(freq)) %>%
    ungroup() %>%
    select(-diff, -id);
## A tibble: 10 x 4
#   group  dt                   freq duration
#   <fct>  <dttm>              <int>    <int>
# 1 groupA 2016-03-21 00:00:00     1        3
# 2 groupA 2016-03-22 00:00:00     1        3
# 3 groupA 2016-03-23 00:00:00     1        3
# 4 groupA 2016-03-26 00:00:00     2        2
# 5 groupA 2016-03-28 00:00:00     1        1
# 6 groupA 2016-04-01 00:00:00     3       11
# 7 groupA 2016-04-02 00:00:00     3       11
# 8 groupA 2016-04-03 00:00:00     5       11
# 9 groupB 2016-04-01 00:00:00     1        3
#10 groupB 2016-04-02 00:00:00     2        3

Thanks @Maurits. Trying to tweak this logic as unfortunately changing the dates within group A to reflect the following dates - c('2016-03-21', '2016-03-22', '2016-03-23', '2016-03-26', '2016-03-28', '2016-04-01', '2016-04-02', '2016-04-03') sums up the values of 26th and 28th together as their differences in time becomes zero. — naanan_, Apr 10 '18 at 02:09
@naanan_ Ah I see what you mean; please see my updated answer. I was missing a `diff > 1` condition in `cumsum`. — Maurits Evers, Apr 10 '18 at 02:26
thank you! A slightly different solution (probably not ideal) based on your idea. :) df <- df %>% arrange(group, dt) %>% group_by(group) %>% mutate(diff = dt - lag(dt), cumsum = cumsum(freq)) df[!is.na(df$diff) & df$diff > 1, ]$diff <- NA df[is.na(df$diff),]$diff <- 0 df$diff <- 1- (df$diff) df <- df %>% group_by(group) %>% mutate( id = cumsum(as.numeric(diff))) %>% group_by(group, id) %>% mutate(duration = sum(freq)) %>% ungroup() — naanan_, Apr 10 '18 at 02:55
You're very welcome. That's (one of) R's strength: There's always more than one solution;-) — Maurits Evers, Apr 10 '18 at 03:02

Time series summary

2 Answers2

Sample data

Update