2

I am trying to determine a robust way to determine for how many historical rows a condition has been true, for each ID in a table.

Here is the example data:

DT <- data.table(update_date = rep(c("2022-01-01", "2022-01-02",
                                     "2022-01-03", "2022-01-04",
                                     "2022-01-05", "2022-01-06"), times = 2),
                 ID = c(rep("aapl", times = 6), rep("ibm", times = 6)),
        b = c("U1", "U1", "U1", "U2", "U2", "U2", "D1", "D2", "D1", "D3", "D2", "D3") )

DT[, update_date := as.Date(update_date)]

    update_date   ID  b
 1:  2022-01-01 aapl U1
 2:  2022-01-02 aapl U1
 3:  2022-01-03 aapl U1
 4:  2022-01-04 aapl U2
 5:  2022-01-05 aapl U2
 6:  2022-01-06 aapl U2
 7:  2022-01-01  ibm D1
 8:  2022-01-02  ibm D2
 9:  2022-01-03  ibm D1
10:  2022-01-04  ibm D3
11:  2022-01-05  ibm D2
12:  2022-01-06  ibm D3

What I need to calculate is for each row, how long the value in b has existed, done by ID.

So for ID == 'aapl', the value for row 6 would be 3, as the value "U2" has existed for 3 days (or rows). The value for row 5 would be 2. The value for row 3 would be 3 again.

For the ID == 'ibm', row 12 would have 1. Row 11 would have 1 as well as"D2" has only been true for 1 day (or row).

I can loop through each ID, and day, and look backward. I'm just wondering if there is a more concise way to do this than row by row.

M--
  • 25,431
  • 8
  • 61
  • 93
AdamC
  • 415
  • 1
  • 4
  • 4
  • You can avoid looping sequentially through each row by creating a sequence within each group `1:N` where N is the number of rows in the group and each group is a unique combination of `ID` and `b`. You can do this efficiently and quickly within a single call: see my proposed answer. – socialscientist Jul 21 '22 at 20:48

3 Answers3

1

Counting consecutive occurrences (i.e. run length) of b for each ID through specified update_date

DT[order(ID, update_date), occurence := 1:.N, by = list(ID, rleid(b))]
DT
#>     update_date   ID  b occurence
#>  1:  2022-01-01 aapl U1         1
#>  2:  2022-01-02 aapl U1         2
#>  3:  2022-01-03 aapl U1         3
#>  4:  2022-01-04 aapl U2         1
#>  5:  2022-01-05 aapl U2         2
#>  6:  2022-01-06 aapl U2         3
#>  7:  2022-01-01  ibm D1         1
#>  8:  2022-01-02  ibm D2         1
#>  9:  2022-01-03  ibm D1         1
#> 10:  2022-01-04  ibm D3         1
#> 11:  2022-01-05  ibm D2         1
#> 12:  2022-01-06  ibm D3         1

Counting occurrences of b for each ID through specified update_date

This includes occurrences that are non-consecutive.

#  Count of occurrences through present row
DT[order(ID, b, update_date), occurence := 1:.N, by = list(ID, b)]
DT
#>     update_date   ID  b occurence
#>  1:  2022-01-01 aapl U1         1
#>  2:  2022-01-02 aapl U1         2
#>  3:  2022-01-03 aapl U1         3
#>  4:  2022-01-04 aapl U2         1
#>  5:  2022-01-05 aapl U2         2
#>  6:  2022-01-06 aapl U2         3
#>  7:  2022-01-01  ibm D1         1
#>  8:  2022-01-02  ibm D2         1
#>  9:  2022-01-03  ibm D1         2
#> 10:  2022-01-04  ibm D3         1
#> 11:  2022-01-05  ibm D2         2
#> 12:  2022-01-06  ibm D3         2
socialscientist
  • 3,759
  • 5
  • 23
  • 58
1
DT[, counter := rowid(rleid(b)), .(ID)]
    update_date   ID  b counter
 1:  2022-01-01 aapl U1       1
 2:  2022-01-02 aapl U1       2
 3:  2022-01-03 aapl U1       3
 4:  2022-01-04 aapl U2       1
 5:  2022-01-05 aapl U2       2
 6:  2022-01-06 aapl U2       3
 7:  2022-01-01  ibm D1       1
 8:  2022-01-02  ibm D2       1
 9:  2022-01-03  ibm D1       1
10:  2022-01-04  ibm D3       1
11:  2022-01-05  ibm D2       1
12:  2022-01-06  ibm D3       1
M--
  • 25,431
  • 8
  • 61
  • 93
0

You could solve your problem as follow (rleid allows to take into account repeated values of b that are not consecutive within each group, like row 12 or 11 where the value of the variable b is not preceded by the same value):

DT[order(ID, update_date), ndays := seq_len(.N), by=.(ID, rleid(b))]

    update_date     ID      b ndays
 1:  2022-01-01   aapl     U1     1
 2:  2022-01-02   aapl     U1     2
 3:  2022-01-03   aapl     U1     3
 4:  2022-01-04   aapl     U2     1
 5:  2022-01-05   aapl     U2     2
 6:  2022-01-06   aapl     U2     3
 7:  2022-01-01    ibm     D1     1
 8:  2022-01-02    ibm     D2     1
 9:  2022-01-03    ibm     D1     1
10:  2022-01-04    ibm     D3     1
11:  2022-01-05    ibm     D2     1
12:  2022-01-06    ibm     D3     1