3

I recently needed to distribute the values of 12 time-invariant variables forwards and backwards across time by id. My dataset contained 2,448,638 observations and 57 variables.

Here is a reproducible example for discussion:

# Load packages
library(tidyverse)
library(zoo)
library(lubridate)
library(tidyr)

# Reproducable example
set.seed(2017)
df <- tibble(
  id       = integer(15),
  days     = integer(15),
  race     = character(15),
  language = character(15)
  ) %>% 

  mutate(
    id = rep(1:3, each = 5)
  ) %>% 

  group_by(id) %>% 

  mutate(
    days     = as.integer(c(rnorm(2, -30, 15), 0, rnorm(2, 200, 100))),
    race     = if_else(days == 0, sample(c("W", "AA", "A", "O"), 1, replace = TRUE), NA_character_),
    language = if_else(days == 0, sample(c("English", "Spanish", "Other"), 1, replace = TRUE), NA_character_)
  ) %>% 

  arrange(id, days)

df

      id  days  race language
   <int> <int> <chr>    <chr>
1      1   -31  <NA>     <NA>
2      1    -8  <NA>     <NA>
3      1     0     W  English
4      1    24  <NA>     <NA>
5      1   273  <NA>     <NA>
6      2   -31  <NA>     <NA>
7      2   -23  <NA>     <NA>
8      2     0     O  English
9      2     4  <NA>     <NA>
10     2   199  <NA>     <NA>
11     3   -33  <NA>     <NA>
12     3    -6  <NA>     <NA>
13     3     0     A  English
14     3   234  <NA>     <NA>
15     3   357  <NA>     <NA>

I figured out a couple ways to get the result I want:

Using zoo::na.locf

time_invariant <- c("race", "language")

df2 <- df %>% 
  group_by(id) %>% 
  mutate_at(.vars = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
  arrange(id, desc(days)) %>%
  mutate_at(.vars = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
  arrange(id, days)

Which takes 0.066293 secs to complete on the reproducible example using my 2016 MB Pro.

I also tried tidyr::fill

df2 <- df %>% 
  group_by(id) %>% 
  fill_(fill_cols = time_invariant) %>% 
  fill_(fill_cols = time_invariant, .direction = "up")

Which takes 0.04381585 secs to complete on the reproducible example using my 2016 MB Pro.

However, on my real data the zoo::na.locf approach took 3.172092 mins, and the tidyr::fill approach took 5.523152 mins. These times aren't terrible, but I did notice that they were considerably slower than Stata (9.9060 secs on my 2016 MB Pro running Stata 14.2). This speed difference prompted me to see if anyone knows of a faster approach.

Brad Cannell
  • 3,020
  • 2
  • 23
  • 39
  • 1
    perhaps a `data.table` solution, along the lines of `library(data.table)` ; `setDT(df)`; `df[, race := race[1], by = cumsum(!is.na(race))]` – SymbolixAU Apr 11 '17 at 02:53
  • I haven't tested that line of code against your versions as I haven't yet managed to bring myself to install "`tidyverse` " – SymbolixAU Apr 11 '17 at 03:03
  • @chinsoon12 - If you get it to work and it benchmarks quicker than the proposed solutions, feel free to add it as an answer; I was merely setting the ball rolling :) – SymbolixAU Apr 11 '17 at 03:16
  • @chinsoon's data.table solution was much faster on my data (~17.26421 secs). I started thinking about what I would do if I only wanted to fill forward or backward (as opposed to across all observations within id). The current best answer seems to be using data.table with [zoo::na.locf](http://stackoverflow.com/questions/12607465/filling-in-missing-blanks-in-a-data-table-per-category-backwards-and-forwar). [Using Rcpp](http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value) may also be an option, but I don't know enough C++ to make any meaningful comments about it. – Brad Cannell Apr 11 '17 at 19:35
  • you might want to post it as another qn and clarify how you would want to handle the 2nd non-NA. e.g. c(NA, NA, 1, NA, NA, 2) into c(1, 1, 1, 1, 1, 2) or into c(1, 1, 1, 2, 2, 2). the soln will probably involve some better `by` groupings (see 1st comment by @SymbolixAU) and/or possibly with `roll` – chinsoon12 Apr 12 '17 at 00:56
  • 1
    You might want to take a look at the na_locf() version of the imputeTS package - which is faster than the zoo version. Might save you some extra seconds. – Steffen Moritz Oct 07 '19 at 16:16

1 Answers1

2

quite sure this can be made faster by experts:

df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
    3L, 3L, 3L, 3L, 3L), days = c(-31L, -8L, 0L, 24L, 273L, -31L, 
        -23L, 0L, 4L, 199L, -33L, -6L, 0L, 234L, 357L), race = c(NA, 
            NA, "W", NA, NA, NA, NA, "O", NA, NA, NA, NA, "A", NA, NA), language = c(NA, 
                NA, "English", NA, NA, NA, NA, "English", NA, NA, NA, NA, "English", 
                NA, NA)), class = "data.frame", row.names = c(NA, -15L), .Names = c("id", 
                    "days", "race", "language"))

library(dplyr)
library(zoo)
library(tidyr)
time_invariant <- c("race", "language")
dplyrzoo <- function() {
    df2 <- df %>% 
        group_by(id) %>% 
        mutate_at(.cols = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
        arrange(id, desc(days)) %>%
        mutate_at(.cols = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
        arrange(id, days)
}

dplyrfill <- function() {
    df2 <- df %>% 
        group_by(id) %>% 
        fill_(fill_cols = time_invariant) %>% 
        fill_(fill_cols = time_invariant, .direction = "up")
}

library(data.table)
dtstyle <- function() {
    dt <- data.table(df)
    cols <- c("race", "language")
    dt[, (cols) := lapply(.SD, function(x) na.omit(x)[1]), .SDcols=cols, by =.(id)]
    dt
}

#check results
all.equal(as.data.frame(dplyrzoo()), as.data.frame(dplyrfill()))
all.equal(as.data.frame(dtstyle()), as.data.frame(dplyrfill()))

#timings
library(microbenchmark)
timings <- capture.output(microbenchmark(dplyrzoo=dplyrzoo(),
    dplyrfill=dplyrfill(),
    dtstyle=dtstyle(),
    times=100L))
writeLines(paste("#", timings))

# Unit: milliseconds
#       expr    min      lq     mean  median      uq     max neval
#   dplyrzoo 6.7952 7.01815 7.399851 7.18815 7.53685 10.8360   100
#  dplyrfill 4.7458 5.02865 5.319848 5.16990 5.34750  7.8329   100
#    dtstyle 1.3598 1.54025 1.692119 1.65420 1.73280  4.0413   100
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • I don't need a `by` for my case. `dplyrfill` works by removing the `group_by` row. However removing `,by = .(id)` stops the `dtstyle` method from working – Olivia Feb 07 '18 at 14:38
  • I need it because `df = na.locf(df)` gives me a memory error. – Olivia Feb 07 '18 at 14:39
  • I'm pulling data from an api where theres a left and right engine with different sensors. They record at different frequencies such that row 1,3... has data for all the left parameters and 2,4.... for the right. I want to fill the NA with the last result. Where the data frames are 10million rows plus I get a `error cannot allocate 1.4gig` when using `df = na.locf(df)` – Olivia Feb 08 '18 at 09:23
  • 1
    @Olivia why don’t you post a new question? I am quite sure someone will be able to help. – chinsoon12 Feb 08 '18 at 12:43
  • `df2 <- df %>% fill_(fill_cols = time_invariant) %>% fill_(fill_cols = time_invariant, .direction = "up")` works, just wondered if the data table one could be done without grouping too as thats faster – Olivia Feb 08 '18 at 13:23
  • 1
    @Olivia you can try `setDT(df)[, time_invariant := na.locf(na.locf(time_invariant), fromLast=TRUE)]` – chinsoon12 Feb 10 '18 at 22:24
  • I was using `cols` for `time_invariant` which was a list of column names. The above doesnt work for multiple columns – Olivia Feb 23 '18 at 11:16
  • 1
    Use setDT(df)[,lapply(.SD, blah blah blah), .SDcols=c(my list of columns)] – chinsoon12 Feb 23 '18 at 13:09