Carry forward a specific value if observations stop before a certain time

Question

For a longitudinal dataset, I want to carry forward observations that terminated before day 7 with y=3, completing records with consecutive days up until day 7 with y=3. A related question is at How to make continuous time sequences within groups in data.table? . The following solution works but I would like to also have a solution that (1) subsetted the observations earlier (see below) or that (2) did the carry forward with a join in one step.

d <- data.table(t =c(1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 5, 6, 7, 1, 2, 3, 5, 6),
                id=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5),
                y =c(1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 3, 3, 3, 1, 2, 2, 2, 3),
                x =c(0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0),
                key=c('id', 't'))
d

    t id y x
 1: 1  1 1 0
 2: 2  1 2 0
 3: 1  2 1 1
 4: 2  2 2 1
 5: 3  2 3 1
 6: 1  3 1 0
 7: 2  3 1 0
 8: 1  4 1 1
 9: 2  4 2 1
10: 3  4 2 1
11: 5  4 3 1
12: 6  4 3 1
13: 7  4 3 1
14: 1  5 1 0
15: 2  5 2 0
16: 3  5 2 0
17: 5  5 2 0
18: 6  5 3 0

w <- d[, .(tlast=t, last3 = t == max(t) & y == 3 & t < 7, x=x), by=id]
w <- w[last3 == TRUE, .(t = (tlast + 1) : 7, y=rep(3, 7 - tlast), x=x), by=id]
d <- rbind(d, w)
setkey(d, id, t)
d

   t id y x
 1: 1  1 1 0
 2: 2  1 2 0
 3: 1  2 1 1
 4: 2  2 2 1
 5: 3  2 3 1
 6: 4  2 3 1
 7: 5  2 3 1
 8: 6  2 3 1
 9: 7  2 3 1
10: 1  3 1 0
11: 2  3 1 0
12: 1  4 1 1
13: 2  4 2 1
14: 3  4 2 1
15: 5  4 3 1
16: 6  4 3 1
17: 7  4 3 1
18: 1  5 1 0
19: 2  5 2 0
20: 3  5 2 0
21: 5  5 2 0
22: 6  5 3 0
23: 7  5 3 0
    t id y x

The following doesn't work (results in data.table with 0 rows and 4 cols)

w <- d[(t == max(t) & y == 3 & t < 7) == TRUE, .SD, by=id]

Hi Prof, in `data.table`, the order of operation is `i` then for each `by`, compute `j` and within each `by`, `.SD` contains the current subset of data (see Take DT, subset/reorder rows using `i`, then calculate `j`, grouped by `by` in https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html). In the example data, there are no rows where `t == max(t) & y == 3 & t < 7` since `max(t)` is 7. — chinsoon12, Apr 09 '21 at 05:12
Thanks - it was the order of operations that I was forgetting. — Frank Harrell, Apr 09 '21 at 11:14

r2evans · Accepted Answer · 2021-04-09T01:00:01.860

data.table

cols <- c("x", "y")
merge(d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t),
      by = .(id)], d, by = c("id", "t"), all.x = TRUE
  )[, (cols) := lapply(.SD, nafill, type = "locf"), by = .(id), .SDcols = cols][]
#        id     t     y     x
#     <num> <num> <num> <num>
#  1:     1     1     1     0
#  2:     1     2     2     0
#  3:     2     1     1     1
#  4:     2     2     2     1
#  5:     2     3     3     1
#  6:     2     4     3     1
#  7:     2     5     3     1
#  8:     2     6     3     1
#  9:     2     7     3     1
# 10:     3     1     1     0
# 11:     3     2     1     0
# 12:     4     1     1     1
# 13:     4     2     2     1
# 14:     4     3     2     1
# 15:     4     5     3     1
# 16:     4     6     3     1
# 17:     4     7     3     1
# 18:     5     1     1     0
# 19:     5     2     2     0
# 20:     5     3     2     0
# 21:     5     5     2     0
# 22:     5     6     3     0
# 23:     5     7     3     0
#        id     t     y     x

Walk-through:

we first need to generate a list that contains the t values we need per-id, so

d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t), by = .(id)]
#        id     t
#     <num> <num>
#  1:     1     1
#  2:     1     2
#  3:     2     1
#  4:     2     2
#  5:     2     3
#  6:     2     4
#  7:     2     5
#  8:     2     6
#  9:     2     7
# 10:     3     1
# 11:     3     2
# 12:     4     1
# 13:     4     2
# 14:     4     3
# 15:     4     5
# 16:     4     6
# 17:     4     7
# 18:     5     1
# 19:     5     2
# 20:     5     3
# 21:     5     5
# 22:     5     6
# 23:     5     7
#        id     t

This does not fill in missing steps (4 is missing in ids 3 and 4). If y contains 3, then we fill out t up to 7, otherwise we do nothing.

Note: t here is numeric, which requires a little dancing with (integer) sequences, ergo the as.numeric to silence data.table's complaints about matching column types.

a simple merge against the original d will leave some NA holes in the data, which is intentional:

merge(d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t), by = .(id)], d, by = c("id", "t"), all.x = TRUE)
#        id     t     y     x
#     <num> <num> <num> <num>
#  1:     1     1     1     0
#  2:     1     2     2     0
#  3:     2     1     1     1
#  4:     2     2     2     1
#  5:     2     3     3     1
#  6:     2     4    NA    NA
#  7:     2     5    NA    NA
#  8:     2     6    NA    NA
#  9:     2     7    NA    NA
# 10:     3     1     1     0
# 11:     3     2     1     0
# 12:     4     1     1     1
# 13:     4     2     2     1
# 14:     4     3     2     1
# 15:     4     5     3     1
# 16:     4     6     3     1
# 17:     4     7     3     1
# 18:     5     1     1     0
# 19:     5     2     2     0
# 20:     5     3     2     0
# 21:     5     5     2     0
# 22:     5     6     3     0
# 23:     5     7    NA    NA
#        id     t     y     x

from here, it's as simple as nafill(., type="locf"), using .SDcols for efficiency (and generality, so that we don't care what other columns there are, as long as cols lists them).

The reason that

d[(t == max(t) & y == 3 & t < 7) == TRUE, .SD, by=id]

returns 0 rows is that the i condition is first and not in .SD. Because of this, it is a global condition, not a per-group condition. The by= is not considered for the condition, so expression is equivalent to

d[(t == max(t) & y == 3 & t < 7),] # no .SD, no by=

which is also 0 rows. But looking at it that way, realize that there is only one row where t == max(t), row 13, where t is 7. On that row, y is 3 (so far so good), but t<7 is not true.

Changing it to the per-group thing within .SD returns data:

d[, .SD[(t == max(t) & y == 3 & t < 7),], by=id]
#       id     t     y     x
#    <num> <num> <num> <num>
# 1:     2     3     3     1
# 2:     5     6     3     0

Very nice! If you would take a step back and assess your solution vs. my original `rbind` solution, in terms of understandability/elegance, I'd appreciate your perception. Also can anyone shed light on why `w <- d[(t == max(t) & y == 3 & t < 7) == TRUE, .SD, by=id] ` doesn't work, and how to make it work? — Frank Harrell, Apr 08 '21 at 22:07
I think conditions like that need to go on `.SD`, not in the global sense. Perhaps `d[,.SD[(t == max(t) & y == 3 & t < 7),], by=id]` is closer? — r2evans, Apr 09 '21 at 00:49
For me, part of the learning-curve of `data.table` is discovering that (typically) when there is a `by=`, anything in `i` should really be in `origdata[,.SD[i,],by=]`, not in `origdata[i,.SD,by=]`. I'm confident there are situations where this rule-of-thumb might not be best, but it's a start. — r2evans, Apr 09 '21 at 00:52

score 0 · Answer 2 · answered Apr 08 '21 at 15:50

I'm not sure this is more efficient, but here's a dplyr approach (my data.table knowledge is too limited for this one). Hopefully, you can adapt a data.table version of this if you find it useful.

library(dplyr)

crossing(id=unique(d$id), t=1:max(d$t)) %>% 
  full_join(d) %>% 
  group_by(id) %>% 
  filter(!(max(y, na.rm=TRUE) < 3 & is.na(y)) &
           !(is.na(y) & !is.na(lead(y)))) %>% 
  mutate(across(c(y,x), zoo::na.locf))

      id     t     y     x
 1     1     1     1     0
 2     1     2     2     0
 3     2     1     1     1
 4     2     2     2     1
 5     2     3     3     1
 6     2     4     3     1
 7     2     5     3     1
 8     2     6     3     1
 9     2     7     3     1
10     3     1     1     0
11     3     2     1     0
12     4     1     1     1
13     4     2     2     1
14     4     3     2     1
15     4     5     3     1
16     4     6     3     1
17     4     7     3     1
18     5     1     1     0
19     5     2     2     0
20     5     3     2     0
21     5     5     2     0
22     5     6     3     0
23     5     7     3     0

Thanks. I'm seeking a `data.table` approach. – Frank Harrell Apr 08 '21 at 16:05 — Frank Harrell, Apr 08 '21 at 16:05

score 0 · Answer 3 · answered Apr 08 '21 at 15:59

If you don't mind using dplyr this might do what you're looking for. Split the data set into two groups then address the cases that need the 3rd row repeated. dplyr::summarize is normally used to collapse many rows into one, but it can also expand one into many.

library(dplyr)

# original data
d <- tibble(t =c(1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 5, 6, 7, 1, 2, 3, 5, 6),
            id=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5),
            y =c(1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 3, 3, 3, 1, 2, 2, 2, 3),
            x =c(0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0)) %>%
        arrange(id, t)


bind_rows(
    
    
    d %>%
        # subset for cases to leave alone
        group_by(id) %>%
        filter( max(t) < 3 | 7 <= max(t))  %>%
        ungroup(),
    
    d %>%
        # subset for cases to fill in by carrying values forward
        group_by(id) %>%
        filter(3 <= max(t) & max(t) < 7) %>%

        # get the 3rd row
        filter(t == 3) %>%

        # repeat 3rd row until there are 7 rows
        summarize(t = seq(t, 7), x = x, y = y)  %>%
        ungroup()
    
) %>% arrange(id, t)

Output

# A tibble: 20 x 4
       t    id     y     x
   <dbl> <dbl> <dbl> <dbl>
 1     1     1     1     0
 2     2     1     2     0
 3     3     2     3     1
 4     4     2     3     1
 5     5     2     3     1
 6     6     2     3     1
 7     7     2     3     1
 8     1     3     1     0
 9     2     3     1     0
10     1     4     1     1
11     2     4     2     1
12     3     4     2     1
13     5     4     3     1
14     6     4     3     1
15     7     4     3     1
16     3     5     2     0
17     4     5     2     0
18     5     5     2     0
19     6     5     2     0
20     7     5     2     0

Thanks. I'm wanting a `data.table` approach. – Frank Harrell Apr 08 '21 at 16:04 — Frank Harrell, Apr 08 '21 at 16:04

Carry forward a specific value if observations stop before a certain time

3 Answers3

data.table