13

I have a large, wide data.table (20m rows) keyed by a person ID but with lots of columns (~150) that have lots of null values. Each column is a recorded state / attribute that I wish to carry forward for each person. Each person may have anywhere from 10 to 10,000 observations and there are about 500,000 people in the set. Values from one person can not 'bleed' into the following person, so my solution must respect the person ID column and group appropriately.

For demonstration purposes - here's a very small sample input:

DT = data.table(
  id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
  aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA),
  bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
  cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA)
)

It looks like this:

    id aa bb cc
 1:  1  A NA  1
 2:  1 NA NA NA
 3:  1  B NA NA
 4:  1  C NA NA
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA NA
 8:  2  E NA  5
 9:  3  F NA  6
10:  3 NA NA NA
11:  3 NA NA  7
12:  3 NA NA NA

My expected output looks like this:

    id aa bb cc
 1:  1  A NA  1
 2:  1  A NA  1
 3:  1  B NA  1
 4:  1  C NA  1
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA  4
 8:  2  E NA  5
 9:  3  F NA  6
10:  3  F NA  6
11:  3  F NA  7
12:  3  F NA  7

I've found a data.table solution that works, but it's terribly slow on my large data sets:

DT[, na.locf(.SD, na.rm=FALSE), by=id]

I've found equivalent solutions using dplyr that are equally slow.

GRP = DT %>% group_by(id)
data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE))))

I was hopeful that I could come up with a rolling 'self' join using the data.table functionality, but I just can't seem to get it right (I suspect I would need to use .N but I just haven't figured it out).

At this point I'm thinking I'll have to write something in Rcpp to efficiently apply the grouped locf.

I'm new to R, but I'm not new to C++ - so I'm confident I can do it. I just feel like there should be an efficient way to do this in R using data.table.

carl.anderson
  • 1,040
  • 11
  • 16
  • I'm pretty sure `DT[, lapply(.SD, na.locf, F), by = id]` will be faster – eddi May 05 '16 at 21:09
  • I actually started with that and found the performance to be worse. – carl.anderson May 05 '16 at 21:19
  • Rolling self join looks to be on point here, I remember some questions that have both `na.locf` and rolling joins answers, so I think you may find the answer in current SO knowledge base. – jangorecki May 05 '16 at 21:33
  • 1
    With an ordered "id", perhaps you could use something like: `tmp = c(TRUE, DT$id[-1] != DT$id[-nrow(DT)]); DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | tmp) * seq_len(nrow(DT)))])]`? – alexis_laz May 05 '16 at 22:49
  • @alexis_laz - wow! That's amazing! It works and it's 2 full orders of magnitude faster than the data.table solution. Can you help me understand what the code is doing? Also, your comment should be made into an answer so I can mark this solved. – carl.anderson May 06 '16 at 03:13

1 Answers1

27

A very simple na.locf can be built by forwarding (cummax) the non-NA indices ((!is.na(x)) * seq_along(x)) and subsetting accordingly:

x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax((!is.na(x)) * seq_along(x))]
# [1] 1 1 1 6 4 5 4 4 4 2

This replicates na.locf with an na.rm = TRUE argument, to get na.rm = FALSE behavior we simply need to make sure the first element in the cummax is TRUE:

x = c(NA, NA, 1, NA, 2)
x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))]
#[1] NA NA  1  1  2

In this case, we need to take into account not only the non-NA indices but, also, of the indices where the (ordered, or to be ordered) "id" column changes value:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
c(TRUE, id[-1] != id[-length(id)])
# [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

Combining the above:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
x =  c(1,  NA, NA, 6,  4,  5,  4,  NA, NA, 2)

x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))]
# [1]  1  1 NA  6  4  5  4  4 NA  2

Note, that here we OR the first element with TRUE, i.e. make it equal to TRUE, thus getting the na.rm = FALSE behavior.

And for this example:

id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
#    id aa bb cc
# 1:  1  A NA  1
# 2:  1  A NA  1
# 3:  1  B NA  1
# 4:  1  C NA  1
# 5:  2 NA NA NA
# 6:  2 NA NA  4
# 7:  2  D NA  4
# 8:  2  E NA  5
# 9:  3  F NA  6
#10:  3  F NA  6
#11:  3  F NA  7
#12:  3  F NA  7
eddi
  • 49,088
  • 6
  • 104
  • 155
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • 7
    the downvote is very much not obvious to me, and some explanation would be appreciated – eddi May 06 '16 at 14:38
  • 2
    Great answer imo - not only is this a much faster version of a regular `na.locf`, but it also adds a modification to do it per group (assuming sorted groups), **without** actually doing a `by` loop (which would introduce extra `eval`'s per group and would slow it down). Unless I'm missing something - this should be the standard `na.locf` implementation, instead of the `rle` stuff that `zoo` does. – eddi May 06 '16 at 15:50
  • @eddi : thanks for the edits. I guess the `zoo::na.locf` is more flexible, though, I believe that for simple cases, the `4-5 * length(x)` scans of the `cummax` version shall be pretty straightforward. And, also, it indeed proved convenient to pass each column pointer once in the function and be applied virtually "by" group. – alexis_laz May 06 '16 at 16:39
  • 2
    I might add that with my original test set of 20m rows, the first suggested `lapply` solution took 40 hours to complete. The new code takes just 4 minutes! I doubt Rcpp could do much better than that. – carl.anderson May 06 '16 at 17:56
  • @carl.anderson I did a quick back of the envelope test, and you'll easily get 2-3x improvement with Rcpp – eddi May 06 '16 at 20:06
  • This answer needs more upvotes, imo. This answer has shaved off so much time from some of my processes. Thank you!! – sehock Jan 24 '20 at 19:43
  • This is an incredible solution! zoo::na.locf was the main bottleneck in my script! – cosmia1 Jun 22 '21 at 00:49
  • zoo::na.locf work incorrectly ```zoo::na.locf(c(NA,1))``` will produce "1" instead of c(NA,1). Also, ```zoo::na.locf(c(NA))``` fail to "logical(0)" answer. ```na.locf``` can't be used. – crow16384 Nov 17 '22 at 05:49