1

Consider the following data:

df <- structure(list(date = structure(c(10904, 10613, 10801, 10849, 
10740, 10680, 10780, 10909, 10750, 10814), class = "Date"), group = c(1L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA, 
-10L))

which gives:

         date group
1  1999-11-09     1
2  1999-01-22     2
3  1999-07-29     1
4  1999-09-15     2
5  1999-05-29     1
6  1999-03-30     1
7  1999-07-08     1
8  1999-11-14     2
9  1999-06-08     2
10 1999-08-11     2

I now want to calculate

  • a) the months that have been passed between two neighbouring dates per group (I know how to do this)
  • b) flag rows when a certain time period (3 months) has passed and if it has passed, i would kind of reset and again look when 3 months have been passed from that date.

So for a) I'm doing this:

library(tidyverse)
library(lubridate)
df %>%
  group_by(group) %>%
  arrange(group, date) %>%
  mutate(months_passed = time_length(interval(lag(date), date), "months"))

which gives:

# A tibble: 10 x 3
# Groups:   group [2]
   date       group months_passed
   <date>     <int>         <dbl>
 1 1999-03-30     1        NA    
 2 1999-05-29     1         1.97 
 3 1999-07-08     1         1.3  
 4 1999-07-29     1         0.677
 5 1999-11-09     1         3.35 
 6 1999-01-22     2        NA    
 7 1999-06-08     2         4.55 
 8 1999-08-11     2         2.10 
 9 1999-09-15     2         1.13 
10 1999-11-14     2         1.97

But for b) I'm lost. What I want to do is:

  • Look at each group separately.
  • Calculate the months_passed between row 1 and row 2 (here: 1.97 months for group 1)
  • If it is < 3 months, continue with the next row and calculate the time difference between the already passed months and the current time difference (here: 1.97 + 1.3 months).
  • Now that the difference is bigger >= 3 months, I want to flag row 3.
  • Now I reset the cumulative time difference and again start calculating the difference with the next row (here: 0.67 months) and so on.

Expected outcome would be:

# A tibble: 10 x 4
# Groups:   group [2]
   date       group months_passed time_flag
   <date>     <int>         <dbl>     <int>
 1 1999-03-30     1        NA             0
 2 1999-05-29     1         1.97          0
 3 1999-07-08     1         1.3           1
 4 1999-07-29     1         0.677         0
 5 1999-11-09     1         3.35          1
 6 1999-01-22     2        NA             0
 7 1999-06-08     2         4.55          1
 8 1999-08-11     2         2.10          0
 9 1999-09-15     2         1.13          1
10 1999-11-14     2         1.97          0

Any ideas?

deschen
  • 10,012
  • 3
  • 27
  • 50

2 Answers2

1

You can write a helper function :

assign_1 <- function(x) {
  y <- numeric(length(x))
  sum <- 0
  for(i in seq_along(x)) {
    sum <- sum + x[i]
    if(sum >= 3) {
      y[i] <- 1
      sum <- 0
    }
  }
  y
}

and use it in your already existing pipe :

library(dplyr)
library(lubridate)

df %>%
  group_by(group) %>%
  arrange(group, date) %>%
  mutate(months_passed = time_length(interval(lag(date, default = first(date)), 
                                              date), "months"), 
         time_flag = assign_1(months_passed)) %>%
   ungroup

#    date       group months_passed time_flag
#   <date>     <int>         <dbl>     <dbl>
# 1 1999-03-30     1         0             0
# 2 1999-05-29     1         1.97          0
# 3 1999-07-08     1         1.3           1
# 4 1999-07-29     1         0.677         0
# 5 1999-11-09     1         3.35          1
# 6 1999-01-22     2         0             0
# 7 1999-06-08     2         4.55          1
# 8 1999-08-11     2         2.10          0
# 9 1999-09-15     2         1.13          1
#10 1999-11-14     2         1.97          0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

BTW, with some better searching (now that I know which terms to search for), I was able to bring up this one completely working without additional functions:

dplyr / R cumulative sum with reset

df %>%
  group_by(group) %>%
  arrange(group, date) %>%
  mutate(months_passed = time_length(interval(lag(date), date), "months"),
         months_passed = if_else(is.na(months_passed), 0, months_passed),
         time_flag = if_else(accumulate(months_passed, ~if_else(.x >= 3, .y, .x + .y)) >= 3, 1, 0))

# A tibble: 10 x 4
# Groups:   group [2]
   date       group months_passed time_flag
   <date>     <int>         <dbl>     <dbl>
 1 1999-03-30     1         0             0
 2 1999-05-29     1         1.97          0
 3 1999-07-08     1         1.3           1
 4 1999-07-29     1         0.677         0
 5 1999-11-09     1         3.35          1
 6 1999-01-22     2         0             0
 7 1999-06-08     2         4.55          1
 8 1999-08-11     2         2.10          0
 9 1999-09-15     2         1.13          1
10 1999-11-14     2         1.97          0
deschen
  • 10,012
  • 3
  • 27
  • 50