1

I want to calculate count of rows before current row in previous 1 year window by id.

Here's my data:

df <- structure(list(id = c("1", "1", "1", "1", 
                                   "2", "2", "2", "2", "2", "2", "2", 
                                   "2", "2"), flag = c(1, 1, 0, 1, 0, 0, 1, 1, 
                                                                         1, 1, 1, 1, 1), date = structure(c(15425, 15456, 16613, 
                                                                                                                       16959, 15513, 15513, 15625, 15635, 15649, 15663, 15670, 16051, 
                                                                                                                       16052), class = "Date")), sorted = "id", class = c("data.table", 
                                                                                                                                                             "data.frame"), row.names = c(NA, -13L))




roll_sum <- c(0, 1, 0, 1, 0, 1, 2, 3, 4, 5, 6, 0, 1)
flag_sum <- c(0, 1, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 1)

df_desired <- cbind(df, roll_sum) # roll_sum: number of rows excluding current row in 1 year time frame rolling
df_desired <- cbind(df_desired, flag_sum) # flag_sum: number of rows excluding current row in 1 year time frame rolling where flag was 1

Data:

id flag       date
 1:  1    1 2012-03-26
 2:  1    1 2012-04-26
 3:  1    0 2015-06-27
 4:  1    1 2016-06-07
 5:  2    0 2012-06-22
 6:  2    0 2012-06-22
 7:  2    1 2012-10-12
 8:  2    1 2012-10-22
 9:  2    1 2012-11-05
10:  2    1 2012-11-19
11:  2    1 2012-11-26
12:  2    1 2013-12-12
13:  2    1 2013-12-13

Output:

df_desired
    id flag       date roll_sum flag_sum
 1:  1    1 2012-03-26        0        0
 2:  1    1 2012-04-26        1        1
 3:  1    0 2015-06-27        0        0
 4:  1    1 2016-06-07        1        0
 5:  2    0 2012-06-22        0        0
 6:  2    0 2012-06-22        1        0
 7:  2    1 2012-10-12        2        0
 8:  2    1 2012-10-22        3        1
 9:  2    1 2012-11-05        4        2
10:  2    1 2012-11-19        5        3
11:  2    1 2012-11-26        6        4
12:  2    1 2013-12-12        0        0
13:  2    1 2013-12-13        1        1

I tried solution given by G. Grothendieck using zoo in Compute rolling sum by id variables, with missing timepoints, but it's giving me an error:

Error in merge.zoo(z, g) : series cannot be merged with non-unique index entries in a series In addition: Warning message: In zoo(count, date) :

I made date column unique using make.index.unique and make.time.unique.

Any help with an optimized solution would be appreciated.Thanks.

sm925
  • 2,648
  • 1
  • 16
  • 28
  • didn't tried this but by its name `uts` package should do what you need: https://github.com/andreas50/uts otherwise you would need to insert NA rows for every date (and probably `id`) and then do regular rolling sum. Fast rolling sum (`frollsum`) is planned for data.table 1.12.0 release. – jangorecki Jun 20 '18 at 16:30
  • I had a look at `uts`, couldn't really figure out how could I solve my problem using it. – sm925 Jun 21 '18 at 14:01

1 Answers1

1

Not sure this will be helpful with the dimension of your data.

First, create running index to handle duplicate date and roll sum must not include prev dupe date and also create date one year ago (i would argue that 365 is better but seems like OP wants 366).

Then, perform a non-equi self-join while ensuring prev dupe date not used and dates are within a year.

df[, c("rn", "oneYrAgo") := .(.I, date - 366)]

df[df, 
    .(roll_sum=.N, flag_sum=sum(flag, na.rm=TRUE)), 
    on=.(date >= oneYrAgo, rn < rn, id, date <= date), 
    by=.EACHI][, 
        -seq_len(2L)]

result:

    id       date roll_sum flag_sum
 1:  1 2012-03-26        0        0
 2:  1 2012-04-26        1        1
 3:  1 2015-06-27        0        0
 4:  1 2016-06-07        1        0
 5:  2 2012-06-22        0        0
 6:  2 2012-06-22        1        0
 7:  2 2012-10-12        2        0
 8:  2 2012-10-22        3        1
 9:  2 2012-11-05        4        2
10:  2 2012-11-19        5        3
11:  2 2012-11-26        6        4
12:  2 2013-12-12        0        0
13:  2 2013-12-13        1        1
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • I guess it worked fine on my data set. Data set which I'm working on is not that big, it's around 75 Mbs. I agree with you 365 days is a better option, it was something flexible. I have two questions regarding this solution: 1. How do I retain all the columns after second line of code? I can join `roll_sum` and `flag_sum` to original data set by `rn` but it's better to retain all the columns. 2. Duplicate columns with same name `date` will be a problem. I would have to filter a column with a date one year back. Otherwise, I really liked your solution it works perfectly. – sm925 Jun 21 '18 at 13:51
  • all columns as in you have much more columns than those in the sample dataset you posted? anyway, i posted a new version where there are there are no additional columns – chinsoon12 Jun 21 '18 at 22:23
  • otherwise, you can do a left join to update as u mentioned or do 2 update join – chinsoon12 Jun 21 '18 at 22:29
  • Yes, it was subset of the columns which I posted here. There are many more columns. But, it's better when there's no duplicated `date` columns in output. For retaining all the columns as you mentioned, I can do a `left_join`. That would work. Thanks for your help. – sm925 Jun 22 '18 at 14:25
  • There are dupe joins in your actual dataset. You need to add allow.cartesian=TRUE in the code – chinsoon12 Jun 22 '18 at 23:26