22

If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is

setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]

which is quite complex. It's a shame since roll is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf within each group)

I can write a convenience function to fill in missing values

   fill_na <-  function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE)){
    id <- seq_along(x)
    if (is.null(by)){
      DT <- data.table("x" = x, "id" = id, key = "id") 
      return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])

    } else{
      DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id")) 
      return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
    }
  }

And then write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]

This is not really satisfying since one would like to write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]

However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na should be called with the by option, and should not be used with data.table by. Is there an elegant solution around this?

Some speed test

N <- 2e6
set.seed(1)
DT <- data.table(
         date = sample(10, N, TRUE),
           id = sample(1e5, N, TRUE),   
        value = sample(c(NA,1:5), N, TRUE),
       value2 = sample(c(NA,1:5), N, TRUE)                   
      )
setkey(DT,id,date)
DT<- unique(DT)

system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user  system elapsed 
#>  0.086   0.006   0.105 
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user  system elapsed 
#> 5.235   0.016   5.274 
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#>   user  system elapsed 
#>  0.194   0.019   0.221 
system.time(DT[, filled3 := fill_na(value), by = id])
#>    user  system elapsed 
#> 237.256   0.913 238.405 

Why don't I just use na.locf.default ? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.

Matthew
  • 2,628
  • 1
  • 20
  • 35
  • 2
    How does the `na.locf` solution compare to this solution in terms of speed? – GSee Oct 03 '14 at 01:23
  • Is wrapping the entire thing (a la `dplyr::mutate`) not an option? – shadowtalker Oct 03 '14 at 01:26
  • It would be helpful if you provided code to [create a sample data.table](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that we could use to check our results and to help with benchmarking. – GSee Oct 03 '14 at 01:31
  • @ssdecontrol what do you mean? – Matthew Oct 03 '14 at 02:01
  • 2
    If you get rid of the `::` call in the zoo one, it's about 30% faster for me. i.e. call `na.locf.default` instead of `zoo::na.locf.default` – GSee Oct 03 '14 at 02:02
  • @GSee out of curiosity, why would that be the case? – shadowtalker Oct 03 '14 at 02:26
  • 3
    @ssdecontrol Because `::` is a function and there is overhead associated with that extra function call. – GSee Oct 03 '14 at 02:28
  • @Matthew I mean, instead of writing a function you can call from inside `[.data.table`, what about a wrapper for the entire operation? But I realize now that a wrapper will probably end up making copies of `DT` (which defeats the purpose of a data.table). – shadowtalker Oct 03 '14 at 02:29
  • @Matthew, it'd be great to be able to roll missing values directly in `data.table`. Could you please file an issue? Just linking back to this SO post is sufficient. – Arun Oct 03 '14 at 06:30

2 Answers2

16

There is now a native data.table way of filling missing values (as of 1.12.4).

This question spawned a github issue which was recently closed with the creation of functions nafill and setnafill. You can now use

DT[, value_filled_in := nafill(value, type = "locf")]

It is also possible to fill NA with a constant value or next observation carried back.

One difference to the approach in the question is that these functions currently only work on NA not NaN whereas is.na is TRUE for NaN - this is planned to be fixed in the next release through an extra argument.

I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.

Update: By default NaN is now treated same as NA.

anotherfred
  • 1,330
  • 19
  • 25
14

Here's a slightly faster and more compact way of doing it (version 1.9.3+):

DT[, filled4 := DT[!is.na(value)][DT, value, roll = T]]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • My mistake! I thought that this would copy the whole table twice (once through `DT[!is.na(value)]`, the other through `X[Y]`), which would be problematic for a typical wide dataset. Isn't the case (at least for `DT[!is.na(value)]`)? – Matthew Oct 03 '14 at 16:57
  • 1
    Ok. Subsetting columns in Y does not change anything. However, it seems `(DT[, filled4 := DT[!is.na(value), list(date,id,value)][DT, value, roll = T]]` is faster than your answer in a wide database – Matthew Oct 03 '14 at 17:03
  • 1
    Hi @eddi. When I try your answer with `data.table 1.12.3` and OP's sample data it errors (`Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :`). Do you know why? Cheers – Henrik Apr 19 '19 at 19:05