dplyr or data.table to calculate time series aggregations in R

Question

I'm trying to summarize a data.frame which contains date (or time) information.

Let's suppose this one containing hospitalization records by patient:

df <- data.frame(c(1, 2, 1, 1, 2, 2),
             c(as.Date("2013/10/15"), as.Date("2014/10/15"), as.Date("2015/7/16"), as.Date("2016/1/7"), as.Date("2015/12/20"), as.Date("2015/12/25")))
names(df) <- c("patient.id", "hospitalization.date")

df looks like this:

> df
      patient.id hospitalization.date
    1          1           2013-10-15
    2          2           2014-10-15
    3          1           2015-07-16
    4          1           2016-01-07
    5          2           2015-12-20
    6          2           2015-12-25

For each observation, I need to count the number of hospitalizations occuring in the 365 days before that hospitalization.

In my example it would be the new df$hospitalizations.last.year column.

> df
      patient.id hospitalization.date hospitalizations.last.year
    1          1           2013-10-15                          1
    2          2           2014-10-15                          1
    3          1           2015-07-16                          1
    4          2           2015-12-20                          1
    5          2           2015-12-25                          2
    6          1           2016-01-07                          2
    7          2           2016-02-10                          3

Note that the counter is including the number of previous records in the last 365 days, not only in the current year.

I'm trying to do that using dplyr or data.table because my dataset is huge and performance matters. ¿Is it possible?

According to this logic, shouldn't the third observation also be counted as 'hospitalization last year'? — Heroka, Feb 22 '16 at 13:53
I'm trying to compute the third column `hospitalizations.last.year` on a per patient basis. So for the sixth row, the patient 2 has been hospitalized 2 times after 2014-12-25, one in 2015-12-25 and 2015-12-20. The other one in 2014-10-15 is outside the 365 days window. — Javi Rodríguez, Feb 22 '16 at 14:02
Still unclear to me, sorry. Just to check: for each observation, you want to count the number of hospitalizations occuring in the 365 days before _that_ hospitalization. (Thanks @docendodiscimus) — Heroka, Feb 22 '16 at 14:04
Seems like a standard rolling join problem, but your last column makes no sense to me. Why is e.g. current hospitalization date not counted for first row, but counted for the last one..? — eddi, Feb 22 '16 at 20:26
Sorry. Last column was wrong. `hospitalizations.last.year` now counts how many times the `patient.id` has been in hospital during last year. — Javi Rodríguez, Feb 22 '16 at 21:33
Have a look in [here](http://stackoverflow.com/questions/41615967/r-calculate-the-number-of-occurrences-of-a-specific-event-in-the-past-and-futur) for inspiration, looks like something that should help you — Kasia Kulma, Jan 17 '17 at 16:36

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Since version 1.9.8 (on CRAN 25 Nov 2016), data.table offers non-equi joins:

library(data.table)
# coerce to data.table
setDT(df)[
  # create helper column
  , date_365 := hospitalization.date - 365][
    # step1: non-equi self-join
    df, on = c("patient.id", "hospitalization.date>=date_365", 
               "hospitalization.date<=hospitalization.date")][
      # step 2: count hospitalizations.last.year for each patient
      , .(hospitalizations.last.year = .N), 
      by = .(patient.id, hospitalization.date = hospitalization.date.1)]

   patient.id hospitalization.date hospitalizations.last.year
1:          1           2013-10-15                          1
2:          2           2014-10-15                          1
3:          1           2015-07-16                          1
4:          2           2015-12-20                          1
5:          2           2015-12-25                          2
6:          1           2016-01-07                          2
7:          2           2016-02-10                          3

Edit: Join and aggregation can be combined in one step:

# coerce to data.table
setDT(df)[
  # create helper column
  , date_365 := hospitalization.date - 365][
    # non-equi self-join
    df, on = c("patient.id", "hospitalization.date>=date_365", 
               "hospitalization.date<=hospitalization.date"), 
    # count hospitalizations.last.year grouped by join parameters
    .(hospitalizations.last.year = .N), by = .EACHI][
      # remove duplicate column
      , hospitalization.date := NULL][]

The result is the same as above.

Data

The OP has provided two data sets with 6 and 7 rows, resp. Here, the data set with 7 rows is used as it was posted as expected result:

df <- data.frame(
  patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
  hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", 
                                   "2016/1/7", "2015/12/20", "2015/12/25", "2016/2/10")))
df <- df[order(df$hospitalization.date), ]

dplyr or data.table to calculate time series aggregations in R

1 Answers1

Data

Linked