1

I have a set of variables that contain data about whether or not a person has ever had certain health conditions. For example, "have you ever had a heart attack?"

If they say "yes" at observation 2, then the answer is still yes at observations 3 and 4. But, it is not necessarily yes at observation 1. The heart attack could have occurred between observation 1 and 2.

If they say "no" at observation 2, then the answer is no at observations 1. But, it is not necessarily no at observations 3 or 4.

Here is a reproducible example:

df <- tibble(
  id = rep(1:3, each = 4),
  obs = rep(1:4, times = 3),
  mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA)
)
df
   id obs mi_ever
1   1   1      NA
2   1   2       0
3   1   3       1
4   1   4      NA
5   2   1      NA
6   2   2       0
7   2   3      NA
8   2   4      NA
9   3   1      NA
10  3   2       1
11  3   3      NA
12  3   4      NA

It's trivial to carry my 0's (No's) backward or carry my 1's (Yes's) forward using zoo::na.locf. However, I'm not sure how to carry 0's backward and 1's forward. Ideally, I'd like the following result:

   id obs mi_ever mi_ever_2
1   1   1      NA         0
2   1   2       0         0
3   1   3       1         1
4   1   4      NA         1
5   2   1      NA         0
6   2   2       0         0
7   2   3      NA        NA
8   2   4      NA        NA
9   3   1      NA        NA
10  3   2       1         1
11  3   3      NA         1
12  3   4      NA         1

I've checked out the following posts, but none seem to cover exactly what I'm asking here.

Carry last Factor observation forward and backward in group of rows in R

Forward and backward fill data frame in R

making a "dropdown" function in R

Any help is appreciated.

Brad Cannell
  • 3,020
  • 2
  • 23
  • 39

2 Answers2

2

Basically I'm marking the items in sequence after the first 1 to become 1 and the ones before the last 0 to become 0.

 ever <- function (x)  min( which( x == 1)) 
 NA_1 <- function(x) seq_along(x) > ever(x)  #could have done in one function
 # check to see if working
 ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
 [1] NA  0  1  1 NA  0 NA NA NA  1  1  1

 NA_0 <- function(x) seq_along(x) < not_yet(x)
 not_yet <- function(x){ max( which( x==0)) }
# make temporary version of 1-modified column
 temp1 <- ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
 df$ever2 <- ave(temp1, df$id, FUN= function(x){ x[NA_0(x) ] <- 0; x})
# then make final version; could have done it "in place" I suppose.
 df
# A tibble: 12 x 4
      id   obs mi_ever ever2
   <int> <int>   <dbl> <dbl>
 1     1     1      NA     0
 2     1     2       0     0
 3     1     3       1     1
 4     1     4      NA     1
 5     2     1      NA     0
 6     2     2       0     0
 7     2     3      NA    NA
 8     2     4      NA    NA
 9     3     1      NA    NA
10     3     2       1     1
11     3     3      NA     1
12     3     4      NA     1

If you need to suppress the warnings that should be possible.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

I took the answer from @42- above (Thank you!), and tweaked it a little bit to further suit my needs. Specifically, I:

  • Took care of the warning "no non-missing arguments to min; returning Infno non-missing arguments to max; returning -Inf".
  • Combined the separate functions into a single function (although the separate functions were extremely useful for learning).
  • Added an optional check_logic argument. When TRUE, the function will return 9's if a 0 comes after a 1. This represents a data error or logic flaw that warrants further investigation.
  • Added an example of using the function with data.table, and on multiple variables at once. This more accurately represents how I'm using the function in real life, and I thought it may be useful to others.

The function:

distribute_ever <- function(x, check_logic = TRUE, ...) {
  if (check_logic) {
    if (length(which(x == 1)) > 0 & length(which(x == 0)) > 0) {
      if (min(which(x == 1)) < max(which(x == 0))) {
        x <- 9                              # Set x to 9 if zero comes after 1
      }
    }
  }
  ones <- which(x == 1)                     # Get indices for 1's
  if (length(ones) > 0) {                   # Prevents warning
    first_1_by_group <- min(which(x == 1))  # Index first 1 by group
    x[seq_along(x) > first_1_by_group] <- 1 # Set x at subsequent indices to 1
  }
  zeros <- which(x == 0)                    # Get indices for 0's
  if (length(zeros) > 0) {                  # Prevents warning
    last_0_by_group <- max(which(x == 0))   # Index last 0 by group
    x[seq_along(x) < last_0_by_group] <- 0  # Set x at previous indices to 0
  }
  x
}

A new reproducible example with multiple "ever" variables and some cases with 0 after 1:

dt <- data.table(
  id = rep(1:3, each = 4),
  obs = rep(1:4, times = 3),
  mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA),
  diab_ever = c(0, NA, NA, 1, 1, NA, NA, 0, 0, NA, NA, NA)
)

Iterate over multiple variables quickly using data.table (with by group processing):

ever_vars <- c("mi_ever", "diab_ever")

dt[, paste0(ever_vars, "_2") := lapply(.SD, distribute_ever), 
   .SDcols = ever_vars, 
   by = id][]

Results:

    id obs mi_ever diab_ever mi_ever_2 diab_ever_2
 1:  1   1      NA         0         0           0
 2:  1   2       0        NA         0          NA
 3:  1   3       1        NA         1          NA
 4:  1   4      NA         1         1           1
 5:  2   1      NA         1         0           9
 6:  2   2       0        NA         0           9
 7:  2   3      NA        NA        NA           9
 8:  2   4      NA         0        NA           9
 9:  3   1      NA         0        NA           0
10:  3   2       1        NA         1          NA
11:  3   3      NA        NA         1          NA
12:  3   4      NA        NA         1          NA

For each input "ever" variable, we have:

  • Created a new variable with "_2" appended to the end of the input variable name. You could also edit "in place" as 42- pointed out, but I like being able to double check my data.
  • Zeroes are carried backward and ones are carried forward in time.
  • NA's after zeros and before ones (within id) are returned unchanged.
  • If there is a 0 (No, I've never had ...) after a 1 (Yes, I've had ...), as is the case with person 2's responses regarding diabetes, then the function returns 9's.
  • If we were to set check_logic to FALSE, then 1's would win out and replace 0's
Brad Cannell
  • 3,020
  • 2
  • 23
  • 39