-1
library(data.table)
set.seed(123)
dt = data.table( grp=round(runif(10)), val=c(runif(4), NA, runif(4), NA) )
dt

Output is:

    grp        val
 1:   0 0.95683335
 2:   1 0.45333416
 3:   0 0.67757064
 4:   1 0.57263340
 5:   1         NA
 6:   0 0.10292468
 7:   1 0.89982497
 8:   1 0.24608773
 9:   1 0.04205953
10:   0         NA

I'd like to fill the val with the previous non-NA value of val.

The SO question "Replacing NAs with latest non-NA value" has an amazing SO answer, which I do not fully comprehend. Nonetheless, I tried:

dt[ , val2 := val[1], .(grp, cumsum(!is.na(val))) ]
dt

Output is:

    grp        val       val2
 1:   0 0.95683335 0.95683335
 2:   1 0.45333416 0.45333416
 3:   0 0.67757064 0.67757064
 4:   1 0.57263340 0.57263340
 5:   1         NA 0.57263340
 6:   0 0.10292468 0.10292468
 7:   1 0.89982497 0.89982497
 8:   1 0.24608773 0.24608773
 9:   1 0.04205953 0.04205953
10:   0         NA         NA

This almost works (it correctly filled in row 5). Why does the 10th row of dt still have an NA value a val2 instead of 0.10292468 (the previous non-NA value for grp == 0)?

TMo
  • 435
  • 4
  • 11

2 Answers2

0

There is a nafill in data.table

library(data.table)
dt[, val2 := nafill(val, type = 'locf')]

-output

dt
#    grp        val       val2
# 1:   0 0.95683335 0.95683335
# 2:   1 0.45333416 0.45333416
# 3:   0 0.67757064 0.67757064
# 4:   1 0.57263340 0.57263340
# 5:   1         NA 0.57263340
# 6:   0 0.10292468 0.10292468
# 7:   1 0.89982497 0.89982497
# 8:   1 0.24608773 0.24608773
# 9:   1 0.04205953 0.04205953
#10:   0         NA 0.04205953

The reason it is NA in the OP's output is based on the grouping

dt[, cumsum(!is.na(val))]
#[1] 1 2 3 4 4 5 6 7 8 8

last two elements are grouped as 8, where as the 'grp' last element is 0, so there is only one element that matches and if we use val[1] which NA and thus we got the NA

It can be more clear if we check the .GRP with by argument

dt[, .GRP, by = .(grp1 = cumsum(!is.na(val)), grp)]
#      grp1 grp GRP
#1:    1   0   1
#2:    2   1   2
#3:    3   0   3
#4:    4   1   4
#5:    5   0   5
#6:    6   1   6
#7:    7   1   7
#8:    8   1   8
#9:    8   0   9  # -> only a single element for group 9
 

Thus, val[1] returns the NA element

However, if we do an order on 'grp' and then create the new group with cumsum, the adjacent elements are different, thus there are more than one element per group

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! I kind of like the other SO answer I just added a little more, only because it is more general and I can specify any value that I want to replace, not just NA. – TMo Feb 28 '21 at 23:10
  • @TMo I answered the reason why you are getting NA in the edit – akrun Feb 28 '21 at 23:10
  • Thanks! I've edited your answer to make sense to me. There is still one part left unexplained. Once you can explain why val[1] results in NA, then I can mark the response as the best answer. – TMo Feb 28 '21 at 23:52
  • @TMo the reason I specified in the post i.e. the last two grouping from `cumsum` is a single value, but the last value in 'grp' is 0, so there is only a single element for the last NA and thus when you do `val[1]` it is the `NA` that gets replaced. However, if you do an `order(grp)` and then apply the `cumsum` on the new ordered data, the adjacent elements are different now, creating a new grouping – akrun Feb 28 '21 at 23:53
0

Weird. I just noticed that the SO answer I referred to first ordered dt by grp. I tried that and it worked.

dt = data.table( grp=round(runif(10)), val=c(runif(4), NA, runif(4), NA) )
dt = dt[ order(grp) , ] ## seems to be critical step
dt[ , val2 := val[1], .(grp, cumsum(!is.na(val))) ]
dt

Output is:

    grp        val       val2
 1:   0 0.95683335 0.95683335
 2:   0 0.67757064 0.67757064
 3:   0 0.10292468 0.10292468
 4:   0         NA 0.10292468
 5:   1 0.45333416 0.45333416
 6:   1 0.57263340 0.57263340
 7:   1         NA 0.57263340
 8:   1 0.89982497 0.89982497
 9:   1 0.24608773 0.24608773
10:   1 0.04205953 0.04205953
TMo
  • 435
  • 4
  • 11