0

I am currently working with a data set in R that looks somewhat like the following (except with millions of pids and observations):

id  agedays  diar
1    1        1
1    2        0
1    3        1
1    4        1
1    5        0
1    6        0
1    7        NA
1    8        1
1    9        1
1    10       1
3    2        0
3    5        0
3    6        0
3    8        1
3    9        1
4    1        0
4    4        NA
4    5        0
4    6        1
4    7        0

I need to create a rolling sum on diar based on increments of agedays values. I want to create a variable that will hold the sum of diar 5 days back for each row of data. The variable will be called diar_prev5. The data set should look like the following :

id  agedays  diar  diar_prev5
1    1        1      NA
1    2        0      NA
1    3        1      NA
1    4        1      NA
1    5        0      3
1    6        0      2
1    7        NA     2
1    8        1      2
1    9        1      2
1    10       1      3
3    2        0      NA
3    5        0      0
3    6        0      0
3    8        1      1
3    9        1      2
4    1        0      NA
4    4       NA      NA
4    5        0      0
4    6        1      1
4    7        0      1

As shown above, the rolling sum should include the current agedays value and if some values in between the current row and the 4 days back contain NA values, the rolling sum should ignore these and still count the obs. in between (if there are any). I had tried both roll_sum and rollsum functions to achieve this request, but found that the function did not work if the agedays column contained gaps. When gaps occurred, the rolling sum would just contain an NA value as opposed to calculating the values in between the gaps. The functions also don't seem to include the present value of agedays in the rolling sum calculation, so I previously had to go back in and manually add this.

The previous code I used pertaining to roll_sum that did not work is seen below:

DT[, diar_prev5 := roll_sum(lag(diar, 1L), n=4L, fill=NA, align = "right"), by=id]

My question now, is how can I create a custom function to achieve the above that will include the current value of diar in the calculation and won't have issues with gaps on agedays values?

I've tried the following- but the variable results with only 0's and doesn't seem to work properly:

f = function(id_input, ageday_input) { 
  startday = ageday_input
  endday = ageday_input- 13
  sum((MPC_anthro %>% filter(id == id_input & agedays <= startday & startday <= endday))$diar) }

f = Vectorize(f)

MPC_anthro_1<-MPC_anthro %>% mutate(diar_prev5 = f(id, agedays))
bziggy
  • 463
  • 5
  • 19
  • One possibility would be to start here: [Fastest way to add rows for missing time steps?](https://stackoverflow.com/a/10473931/1851712), then apply `data.table::frollsum` (unclear what `roll_sum` is). Cheers. – Henrik Jul 23 '20 at 19:07
  • @Henrik thanks for the suggestion! I have tried filling in the days but since the file is very large, it is wildly inefficient and takes quite a while to complete. – bziggy Jul 23 '20 at 19:10
  • why is diar_prev5 = 0 on id = 3 / ageday = 5 : there are only 4 days with known values as the series start with 2. According to your calculation for id 1, shouldn't this be 0? – Waldi Jul 23 '20 at 21:59
  • @Waldi it should be summing up the value of ageday==2 and ageday==5, so NA + 0. Therefore diar_prev5 should be 0. I probably didn't explain this very well, apologies. It says diar_prev5, but it should be counting the current diar value and then incrementing back 4 levels and summing up accordingly. So in total 5 days of diar are summed. – bziggy Jul 23 '20 at 22:14
  • 1
    OK, thanks, see my edit : even simpler – Waldi Jul 23 '20 at 22:22
  • @Waldi Solved!!! :) – bziggy Jul 23 '20 at 22:32
  • OK, perfect! The solution is also not limited to agedays <= 13 – Waldi Jul 23 '20 at 22:33
  • yeah, should have used something like `.(agedays=1:max(agedays))` – DaveTurek Jul 23 '20 at 22:43

3 Answers3

1

I suspected data.table should be fast to fill in the missing agedays, even for a large dataset. So I built one with a million rows and tried the approach shown in Filling missing dates by group.

In the link @Henrik gave, you would find that he also gave this link.

library(data.table)
set.seed(2345)
DT <- data.table(
 id=rep(1:100000,each=10),
 agedays=unlist(lapply(1:100000,function(x) sort(sample(1:13,10,replace=FALSE)))),
 diar=sample(c(0,1,NA),1e6,replace=TRUE)
 )
 
 DT1 <- DT[,.(agedays=1:max(agedays)),by=.(id)]

 DT1[,diar:=NA_integer_][DT,diar:=i.diar,on=c("id","agedays")]
 
 DT1[,diar_prev5 := frollsum(diar,5,na.rm=TRUE),by=id]

 DT2 <- DT1[DT,.(id,agedays,diar,diar_prev5),on=c("id","agedays")]
  
 head(DT2,25)

giving

        id agedays diar diar_prev5
 1:  1       1    0         NA
 2:  1       2   NA         NA
 3:  1       3    0         NA
 4:  1       4    1         NA
 5:  1       7    1          2
 6:  1       8    1          3
 7:  1      10   NA          2
 8:  1      11    1          3
 9:  1      12   NA          2
10:  1      13    1          2
11:  2       1    1         NA
12:  2       3   NA         NA
13:  2       4    0         NA
14:  2       5    0          1
15:  2       6   NA          0
16:  2       7    1          1
17:  2       8    1          2
18:  2       9   NA          2
19:  2      10    1          3
20:  2      12    1          3
21:  3       2    1         NA
22:  3       3   NA         NA
23:  3       4   NA         NA
24:  3       6    1          2
25:  3       7    0          1
    id agedays diar diar_prev5

It didn't take too long to run. Is this what you are looking for?

DaveTurek
  • 1,297
  • 7
  • 8
1

A data.table solution using overlapping intervals instead of filling missing values :

DT[  ,.(id,start=agedays-4L,stop=agedays,agedays,diar)][
   DT,on=.(id=id,stop >= agedays,start <= agedays),.(id,agedays,diar),allow.cartesian=T][
     ,.(diar_prev5 = sum(diar,na.rm=T)),by = .(id,agedays)][
     ,.(id,agedays, diar_prev5 = ifelse(agedays>=5,diar_prev5,NA))]

   id agedays diar_prev5
 1:  1       1         NA
 2:  1       2         NA
 3:  1       3         NA
 4:  1       4         NA
 5:  1       5          3
 6:  1       6          2
 7:  1       7          2
 8:  1       8          2
 9:  1       9          2
10:  1      10          3
11:  3       2         NA
12:  3       5          0
13:  3       6          0
14:  3       8          1
15:  3       9          2
16:  4       1         NA
17:  4       4         NA
18:  4       5          0
19:  4       6          1
20:  4       7          1

The logic is better explained by running the two first steps separately:

DT[  ,.(id,start=agedays-4L,stop=agedays,agedays,diar)][
  DT,on=.(id=id,stop >= agedays,start <= agedays),.(id,x.start,x.stop,x.agedays,i.agedays,diar),allow.cartesian=T][order(id,x.start,i.agedays)]

    id x.start x.stop x.agedays agedays i.agedays diar
 1:  1      -3      1         1       1         1    1
 2:  1      -2      2         2       2         1    0
 3:  1      -2      2         2       2         2    0
 4:  1      -1      3         3       3         1    1
 5:  1      -1      3         3       3         2    1
 6:  1      -1      3         3       3         3    1
 7:  1       0      4         4       4         1    1
 8:  1       0      4         4       4         2    1
 9:  1       0      4         4       4         3    1
10:  1       0      4         4       4         4    1
11:  1       1      5         5       5         1    0
12:  1       1      5         5       5         2    0
13:  1       1      5         5       5         3    0
14:  1       1      5         5       5         4    0
15:  1       1      5         5       5         5    0
  1. We create an observation window [start = agedays-4,stop = agedays]
  2. We join the table to the observation window so that for each ageday we get all i.agedays and diar in the observation window
    In a data.table join, the terms coming from the left side of the join are prefixed by x. if the same name exists on the right side. The terms from the right side are prefixed by .i
  3. the next steps just sum up all the rows in the observation window for each ageday
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • You might add the `diar` column to the output to make it easier to see how it does the rolling sum. – DaveTurek Jul 23 '20 at 22:46
  • @Dave, I agree, the way it works isn't easy to visualize : see my edit – Waldi Jul 23 '20 at 23:24
  • @ Waldi that is helpful. What I meant to suggest was that you add `diar` to the final output, like the OP did. That way it is easy to see that `diar_prev5` is in fact the correct rolling sum. – DaveTurek Jul 24 '20 at 00:04
1

Here is another option using rolling join:

n <- 5L
DT[, c("ndaysago", "val") := .(agedays - n + 1L, fcoalesce(diar, 0L))]
DT[, cs := cumsum(val), id]

DT[, diar_prev := DT[DT, on=.(id, agedays=ndaysago), roll=-n, i.cs - x.cs + x.val]]
DT[agedays - n < 0L, diar_prev := NA_integer_]

output:

    id agedays diar diar_prev5 ndaysago val cs diar_prev
 1:  1       1    1         NA       -3   1  1        NA
 2:  1       2    0         NA       -2   0  1        NA
 3:  1       3    1         NA       -1   1  2        NA
 4:  1       4    1         NA        0   1  3        NA
 5:  1       5    0          3        1   0  3         3
 6:  1       6    0          2        2   0  3         2
 7:  1       7   NA          2        3   0  3         2
 8:  1       8    1          2        4   1  4         2
 9:  1       9    1          2        5   1  5         2
10:  1      10    1          3        6   1  6         3
11:  3       2    0         NA       -2   0  0        NA
12:  3       5    0          0        1   0  0         0
13:  3       6    0          0        2   0  0         0
14:  3       8    1          1        4   1  1         1
15:  3       9    1          2        5   1  2         2
16:  4       1    0         NA       -3   0  0        NA
17:  4       4   NA         NA        0   0  0        NA
18:  4       5    0          0        1   0  0         0
19:  4       6    1          1        2   1  1         1
20:  4       7    0          1        3   0  1         1
21:  5       1    1         NA       -3   1  1        NA
22:  5       6    2          2        2   2  3         2
23:  5      10    3          5        6   3  6         5
24:  5      15    4          4       11   4 10         4
    id agedays diar diar_prev5 ndaysago val cs diar_prev

data with one more id:

DT <- fread("id  agedays  diar  diar_prev5
1    1        1      NA
1    2        0      NA
1    3        1      NA
1    4        1      NA
1    5        0      3
1    6        0      2
1    7        NA     2
1    8        1      2
1    9        1      2
1    10       1      3
3    2        0      NA
3    5        0      0
3    6        0      0
3    8        1      1
3    9        1      2
4    1        0      NA
4    4       NA      NA
4    5        0      0
4    6        1      1
4    7        0      1
5 1 1 NA
5 6 2 2
5 10 3 5
5 15 4 4")

Would be interested to know the runtime on your actual dataset.

chinsoon12
  • 25,005
  • 4
  • 25
  • 35