I recently needed to distribute the values of 12 time-invariant variables forwards and backwards across time by id. My dataset contained 2,448,638 observations and 57 variables.
Here is a reproducible example for discussion:
# Load packages
library(tidyverse)
library(zoo)
library(lubridate)
library(tidyr)
# Reproducable example
set.seed(2017)
df <- tibble(
id = integer(15),
days = integer(15),
race = character(15),
language = character(15)
) %>%
mutate(
id = rep(1:3, each = 5)
) %>%
group_by(id) %>%
mutate(
days = as.integer(c(rnorm(2, -30, 15), 0, rnorm(2, 200, 100))),
race = if_else(days == 0, sample(c("W", "AA", "A", "O"), 1, replace = TRUE), NA_character_),
language = if_else(days == 0, sample(c("English", "Spanish", "Other"), 1, replace = TRUE), NA_character_)
) %>%
arrange(id, days)
df
id days race language
<int> <int> <chr> <chr>
1 1 -31 <NA> <NA>
2 1 -8 <NA> <NA>
3 1 0 W English
4 1 24 <NA> <NA>
5 1 273 <NA> <NA>
6 2 -31 <NA> <NA>
7 2 -23 <NA> <NA>
8 2 0 O English
9 2 4 <NA> <NA>
10 2 199 <NA> <NA>
11 3 -33 <NA> <NA>
12 3 -6 <NA> <NA>
13 3 0 A English
14 3 234 <NA> <NA>
15 3 357 <NA> <NA>
I figured out a couple ways to get the result I want:
Using zoo::na.locf
time_invariant <- c("race", "language")
df2 <- df %>%
group_by(id) %>%
mutate_at(.vars = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
arrange(id, desc(days)) %>%
mutate_at(.vars = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
arrange(id, days)
Which takes 0.066293 secs to complete on the reproducible example using my 2016 MB Pro.
I also tried tidyr::fill
df2 <- df %>%
group_by(id) %>%
fill_(fill_cols = time_invariant) %>%
fill_(fill_cols = time_invariant, .direction = "up")
Which takes 0.04381585 secs to complete on the reproducible example using my 2016 MB Pro.
However, on my real data the zoo::na.locf approach took 3.172092 mins, and the tidyr::fill approach took 5.523152 mins. These times aren't terrible, but I did notice that they were considerably slower than Stata (9.9060 secs on my 2016 MB Pro running Stata 14.2). This speed difference prompted me to see if anyone knows of a faster approach.