4

I need to create a variable that counts the number of observations that have occurred in the last 30 days for each id.

For example, imagine an observation that occurs on 1/2/2021 (d / m / y) for the id "a". If this observation is the first between 1/1/2021 and 1/2/2021 for the id "a" the variable must give 1. If it is the second, 2, etc.

Here is a larger example:

dat <- tibble::tribble(
  ~id,  ~q,   ~date,
  "a",   1,   "01/01/2021",
  "a",   1,   "01/01/2021",
  "a",   1,   "21/01/2021",
  "a",   1,   "21/01/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "b",   1,   "02/02/2021",
  "b",   1,   "02/02/2021",
  "b",   1,   "22/02/2021",
  "b",   1,   "22/02/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021")
dat$date <- lubridate::dmy(dat$date)

The result should be:

id  q   date    newvar
a   1   01/01/2021  1
a   1   01/01/2021  2
a   1   21/01/2021  3
a   1   21/01/2021  4
a   1   12/02/2021  3
a   1   12/02/2021  4
a   1   12/02/2021  5
a   1   12/02/2021  6
b   1   02/02/2021  1
b   1   02/02/2021  2
b   1   22/02/2021  3
b   1   22/02/2021  4
b   1   13/03/2021  3
b   1   13/03/2021  4
b   1   13/03/2021  5
b   1   13/03/2021  6

Thank you very much.

zephryl
  • 14,633
  • 3
  • 11
  • 30
Amc
  • 131
  • 8
  • 1
    I don't quite understand the algorithm. I thought it was giving a sequence of increasing integers by id and date, but then you've got the 12/02/2021 for observations "a" going from 3-6 instead of 1-4. How do we figure out at what value the sequence is supposed to start? – DaveArmstrong Feb 17 '22 at 14:27
  • Hi, because the observations 12/02/2021 for "a" counts the number of observations in "a" between 12/01/2021 and 12/02/2021 (30 days). So, two obs in 21/01/2021 should be considered. – Amc Feb 17 '22 at 14:31

2 Answers2

5

With sapply and between, count the number of observations prior to the current observation that are within 30 days.

library(lubridate)
library(dplyr)
dat %>% 
  group_by(id) %>% 
  mutate(newvar = sapply(seq(length(date)), 
                         function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))

# A tibble: 16 x 4
# Groups:   id [2]
   id        q date       newvar
   <chr> <dbl> <date>      <int>
 1 a         1 2021-01-01      1
 2 a         1 2021-01-01      2
 3 a         1 2021-01-21      3
 4 a         1 2021-01-21      4
 5 a         1 2021-02-12      3
 6 a         1 2021-02-12      4
 7 a         1 2021-02-12      5
 8 a         1 2021-02-12      6
 9 b         1 2021-02-02      1
10 b         1 2021-02-02      2
11 b         1 2021-02-22      3
12 b         1 2021-02-22      4
13 b         1 2021-03-13      3
14 b         1 2021-03-13      4
15 b         1 2021-03-13      5
16 b         1 2021-03-13      6
Maël
  • 45,206
  • 3
  • 29
  • 67
3

Left join dat to itself on the indicated condition grouping by the rows of the left hand data frame. We assume that you want a 30 day window ending at current row but if you wanted 30 days ago (31 day window) then change 29 to 30. Both give the same result for this data.

library(sqldf)

sqldf("select a.*, count(b.date) as newvar
  from dat a left join dat b
  on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
  group by a.rowid")

giving:

   id q       date        newvar
1   a 1 2021-01-01             1
2   a 1 2021-01-01             2
3   a 1 2021-01-21             3
4   a 1 2021-01-21             4
5   a 1 2021-02-12             3
6   a 1 2021-02-12             4
7   a 1 2021-02-12             5
8   a 1 2021-02-12             6
9   b 1 2021-02-02             1
10  b 1 2021-02-02             2
11  b 1 2021-02-22             3
12  b 1 2021-02-22             4
13  b 1 2021-03-13             3
14  b 1 2021-03-13             4
15  b 1 2021-03-13             5
16  b 1 2021-03-13             6

To write it in a pipeline using [.] to denote the input data frame works.

dat %>% { 
  sqldf("select a.*, count(b.date) as newvar
    from [.] a left join [.] b
      on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
    group by a.rowid")
  }

This runs roughly twice as fast as sapply on the data in the question.

library(microbenchmark)
microbenchmark(
  sqldf = sqldf("select a.*, count(b.date) as newvar
    from dat a left join dat b
    on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
    group by a.rowid"),
  sapply = dat %>% 
    group_by(id) %>% 
    mutate(newvar = sapply(seq(length(date)), 
                         function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
)

giving:

Unit: milliseconds
   expr     min       lq     mean  median       uq      max neval cld
  sqldf 26.2768 26.77340 27.97039 27.0082 27.29515  63.1032   100  a 
 sapply 42.8800 43.69345 48.53094 44.1089 45.25275 285.4861   100   b
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341