4

I have a dataset comprised of students (id) and the grade they where in every year:

library(data.table)
set.seed(1)
students <- data.table("id" = rep(1:10, each = 10),
                "year" = rep(2000:2009, 10),
                "grade" = sample(c(9:11, rep(NA, 5)), 100, replace = T))

Here is a sample for student 1:

     id year grade
  1:  1 2000     9
  2:  1 2001    NA
  3:  1 2002    NA
  4:  1 2003     9
  5:  1 2004    10
  6:  1 2005    NA
  7:  1 2006    NA
  8:  1 2007    11
  9:  1 2008    NA

I would like to have a way to access each students prior and future grades to preform different operations. Say for example, adding the last three grades of the student. This would result in a dataset like this one:

    id year grade sum_lag_3
 1:  1 2000     9         9 # 1st window, size 1: 9
 2:  1 2001    NA         9 
 3:  1 2002    NA         9
 4:  1 2003     9        18 # 2nd, size 2: 9 + 9 = 18 
 5:  1 2004    10        28 # 3rd, size 3: 9 + 9 + 10 = 28
 6:  1 2005    NA        28
 7:  1 2006    NA        28
 8:  1 2007    11        30 # 4th, size 3: 9 + 10 + 11 = 30 
 9:  1 2008    NA        30
10:  1 2009    10        31 # 5th, size 3: 10 + 11 + 10 = 31

11:  2 2001    11        11 # 1st window, size 1: 11 

(All results would look like this).

  • This however is NOT a post about preforming a rolling sum.
  • I want to be able to more generally preform operations within each group, to do this I would need to find a way to reference all of a students past and future grades.

So in the case of the first row, since there are no previous observations this would mean the 'past' vector is empty but the 'future' vector one would be NA NA 9 10 NA NA 11 NA 10.

Similarly, for the second row the 'past' vector would be 9 and the 'future' vector would be:

NA 9 10 NA NA 11 NA 10

And for the third row the 'past' vector would be 9 NA and the 'future' vector would be:

9 10 NA NA 11 NA 10

This is the information I want reference to make different calculations. Calculations that are only within each group and vary depending on the context. Preferably I would like to do this using data.table and without reshaping my data in to a wide format.

I've tried doing the following:

students[, .SD[, sum_last_3:= ...], by = id]

but I get an error message saying this feature is not yet available on data.table (where ... is a placeholder for any operation.).

Thank you all.


cach dies
  • 331
  • 1
  • 14
  • I think you want a rolling sum of the last 3 scores – Nate Jun 16 '20 at 21:11
  • Year and Grade need to vary together while generating data. e.g. 2000 g=9, 2001 g=10, ... They cannot be independent. The range on Year and on Grade must be the same... A 2000:2009 range would equate to ELEM PreK,K,1--8 ... OR 2001:2004 might be HS 9-12. – donPablo Jun 16 '20 at 21:14
  • @Henrik, I've added a small sample of the result – cach dies Jun 16 '20 at 21:17
  • 1
    @Nate, is more than that, I want to be able to preform operations utilizing the past and future of a student for all kinds of operations not just a sum. – cach dies Jun 16 '20 at 21:19
  • @donPablo No, the generation as presented in the post is faithful to my real-world data. – cach dies Jun 16 '20 at 21:19
  • 1
    When you have many groups, setting `setDTthreads(1)` might speed up. Or if groups are of equal length, then you can `dcast` and compute all at once, not by group, in parallel. – jangorecki Jun 17 '20 at 18:11
  • @Henrik, than you for you contributions, however this post is **not** about how to preform a rolling sum. It is instead about how to preform contextual, within-group operations. So I have edited the post to reflect that. – cach dies Jun 18 '20 at 01:50
  • Thank you for your feedback. You write: "_this post is not about how to preform a rolling sum_". However, the desired result that you _actually showed_ in your original post corresponded to a rolling sum allowing for partial windows. To which both @chinsoon12 and I provided answers. In your edit you have the variable `sum_last_3`, but in the examples it seems like there is no restriction to the window size. – Henrik Jun 18 '20 at 07:52
  • This contradicts the original example, which you still have in the question, where window size is 3... Therefore, as I wrote before, it is critical to provide an example where you provide the desired output explicitly and also describe it in words. We are not mind readers ;) Cheers – Henrik Jun 18 '20 at 07:55
  • @Henrik, the idea with the rolling sum is to illustrate a *process* rather that to achieve an outcome. Hence, why the rolling sum is, and has always been, referenced only as an example. The true objective here is to do contextualized within-group calculations. I've tried to make this as clear as possible, if that is not the case I will happily implement any changes that make it easier to understand. – cach dies Jun 18 '20 at 22:06
  • @cachdies Thanks a lot for your reply and clarification. It seems like chinsoon12's edit solves your problem. – Henrik Jun 18 '20 at 22:31

2 Answers2

4

Here is an option using frollsum in data.table by applying it on the non-NA values first before carrying last observation forward:

students[!is.na(grade), sum_lag_3 := 
    fcoalesce(frollsum(grade, 3L), as.double(cumsum(grade))), id]
students[, sum_lag_3 := nafill(sum_lag_3, "locf"), id]

output:

     id year grade sum_lag_3
  1:  1 2000     9         9
  2:  1 2001    NA         9
  3:  1 2002    NA         9
  4:  1 2003     9        18
  5:  1 2004    10        28
  6:  1 2005    NA        28
  7:  1 2006    NA        28
  8:  1 2007    11        30
  9:  1 2008    NA        30
 10:  1 2009    10        31
 11:  2 2000    11        11    <-----
 12:  2 2001    11        22
 13:  2 2002     9        31
 14:  2 2003    NA        31
 15:  2 2004    NA        31
 16:  2 2005    10        30
 17:  2 2006    NA        30
 18:  2 2007    NA        30
 19:  2 2008    10        29
 20:  2 2009    NA        29
 21:  3 2000     9         9
 22:  3 2001    NA         9
 23:  3 2002    NA         9
 24:  3 2003    NA         9
 25:  3 2004     9        18
 26:  3 2005     9        27
 27:  3 2006    NA        27
 28:  3 2007    NA        27
 29:  3 2008    NA        27
 30:  3 2009    10        28
 31:  4 2000    10        10
 32:  4 2001    NA        10
 33:  4 2002     9        19
 34:  4 2003    NA        19
 35:  4 2004    NA        19
 36:  4 2005     9        28
 37:  4 2006    NA        28
 38:  4 2007    11        29
 39:  4 2008    NA        29
 40:  4 2009    10        30
 41:  5 2000    10        10
 42:  5 2001    NA        10
 43:  5 2002    NA        10
 44:  5 2003    NA        10
 45:  5 2004    NA        10
 46:  5 2005    NA        10
 47:  5 2006    10        20
 48:  5 2007    NA        20
 49:  5 2008     9        29
 50:  5 2009    NA        29
 51:  6 2000    NA        NA
 52:  6 2001     9         9
 53:  6 2002    NA         9
 54:  6 2003    NA         9
 55:  6 2004     9        18
 56:  6 2005    NA        18
 57:  6 2006    NA        18
 58:  6 2007    NA        18
 59:  6 2008    10        28
 60:  6 2009    NA        28
 61:  7 2000    11        11
 62:  7 2001    10        21
 63:  7 2002    NA        21
 64:  7 2003    NA        21
 65:  7 2004    NA        21
 66:  7 2005    NA        21
 67:  7 2006    10        31
 68:  7 2007    NA        31
 69:  7 2008    10        30
 70:  7 2009    NA        30
 71:  8 2000    NA        NA
 72:  8 2001    NA        NA
 73:  8 2002     9         9
 74:  8 2003    11        20
 75:  8 2004    11        31
 76:  8 2005    NA        31
 77:  8 2006    NA        31
 78:  8 2007    NA        31
 79:  8 2008    NA        31
 80:  8 2009    NA        31
 81:  9 2000    NA        NA
 82:  9 2001    NA        NA
 83:  9 2002    NA        NA
 84:  9 2003    11        11
 85:  9 2004     9        20
 86:  9 2005    NA        20
 87:  9 2006    NA        20
 88:  9 2007    NA        20
 89:  9 2008     9        29
 90:  9 2009    NA        29
 91: 10 2000     9         9
 92: 10 2001    NA         9
 93: 10 2002    NA         9
 94: 10 2003    NA         9
 95: 10 2004    NA         9
 96: 10 2005    NA         9
 97: 10 2006    NA         9
 98: 10 2007    NA         9
 99: 10 2008    NA         9
100: 10 2009    NA         9
     id year grade sum_lag_3

To address OP's edit: You can loop through each row of each student to get your past vector and future vector:

#for example using sum on past grades and mean on future grades
pastFunc <- sum
futureFunc <- mean

students[, {
  vapply(1L:.N, function(n) {
    past <- grade[seq_len(n-1)]
    future <- grade[seq_len(.N-n)+n]
    sum(past, na.rm=TRUE) + mean(future, na.rm=TRUE)
  }, numeric(1L))  
}, id]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Thank you @chinsoon12, this is not just a post about a rolling sum, it is about how to preform contextual within-group operations. Also, while your code works for student 1 it does not work for the other students in the case of the rolling sum. But I appreciate your interest. – cach dies Jun 18 '20 at 01:55
  • hi @cachdies, can you share which student is wrong and also your actual formula? – chinsoon12 Jun 18 '20 at 02:18
  • sure thing, look and row 11, the first row of student 2 has 31 instead of 11. Henrik's answer replicates the desired output data for the sum example. – cach dies Jun 18 '20 at 02:58
  • @cachdies i do not get 31 as u mentioned i get 11 too – chinsoon12 Jun 18 '20 at 03:29
  • oh I was looking at the wrong row, I'm terribly sorry. Thank you very much you are right. – cach dies Jun 18 '20 at 03:32
4

Similar to @chinsoon12, but using zoo::rollapply to easily apply sum to a partial window.

d[!is.na(grade), rs := rollapply(grade, 3, sum, align = "right", partial = TRUE), by = id]
d[ , rs := nafill(rs, type = "locf"), by = id]

#     id year grade sum_lag_3 rs
#  1:  1 2000     9         9  9
#  2:  1 2001    NA         9  9
#  3:  1 2002    NA         9  9
#  4:  1 2003     9        18 18
#  5:  1 2004    10        28 28
#  6:  1 2005    NA        28 28
#  7:  1 2006    NA        28 28
#  8:  1 2007    11        30 30
#  9:  1 2008    NA        30 30
# 10:  1 2009    10        31 31
# 11:  2 2001    11        11 11

In data.table::frollsum, "partial window feature is not supported, although it can be accomplished by using adaptive=TRUE", and an adaptive rolling function (see ?frollsum):

arf = function(n, len) if(len < n) seq.int(len) else c(seq.int(n), rep(n, len - n))
# if no 'grade' is shorter than n (the full window width), you only need: 
# c(seq.int(n), rep(n, len - n))

d[!is.na(grade) , rs2 := frollsum(grade, n = arf(3, .N), align = "right", adaptive = TRUE),
 by = id]
d[ , rs2 := nafill(rs, type = "locf"), by = id]

#     id year grade sum_lag_3 rs rs2
#  1:  1 2000     9         9  9   9
#  2:  1 2001    NA         9  9   9
#  3:  1 2002    NA         9  9   9
#  4:  1 2003     9        18 18  18
#  5:  1 2004    10        28 28  28
#  6:  1 2005    NA        28 28  28
#  7:  1 2006    NA        28 28  28
#  8:  1 2007    11        30 30  30
#  9:  1 2008    NA        30 30  30
# 10:  1 2009    10        31 31  31
# 11:  2 2001    11        11 11  11

A note on your comment:

I want to be able to preform operations utilizing the past and future of a student for all kinds of operations not just a sum

In zoo::rollapply you can put other functions in the FUN argument. Currently the data.table equivalent, frollapply, does not have the adaptive argument. Thus, the method I used for frollsum above can not yet be applied in frollapply.

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Thanks Henrik, this solution very nice I don't have a problem using `zoo::rollapply`. And buy using `FUN = c` I get something very close to what I'm after however, the `width` argument would have to vary by year and the function would need to be called twice (one for past and once for the future). However I cant mark it a correct answer as I've failed to implement this function correctly to obtain these results. I don't know why. – cach dies Jun 18 '20 at 02:14