4

I am trying to carry forward the last observation with a large data.table by group with a limit of 2. There are plenty of complex solutions out here but none of them seem to incorporate all 3 elements: an na.locf-like function with a maximum limit, by groups, in data.table.

My data looks like:

df <- structure(list(country = c("USA", "USA", "USA", "USA", "USA", 
"FR", "FR", "FR", "FR", "FR"), values = c(2, 1, NA, NA, NA, 2, 
1, 2, NA, NA)), class = c("data.table", "data.frame"), row.names = c(NA, 
-10L))

      country values
 1:     USA      2
 2:     USA      1
 3:     USA     NA
 4:     USA     NA
 5:     USA     NA
 6:      FR      2
 7:      FR      1
 8:      FR      2
 9:      FR     NA
10:      FR     NA

I want it to look like this:

     country values
1      USA      2
2      USA      1
3      USA      1
4      USA      1
5      USA     NA
6       FR      2
7       FR      1
8       FR      2
9       FR      2
10      FR      2
  • This seems to be exactly what you need: https://stackoverflow.com/questions/38822441/na-locf-fill-nas-up-to-maxgap-even-if-gap-maxgap-with-groups. Just need to change `Price` and `ID` with `values` and `country`, respectively. – Marcelo Avila Jul 17 '21 at 11:32
  • The grouping element using data.table is missing there. None of these seem to provide a solution for me. @MarceloAvila Unfortunately that solution incorporates both backward carrying of the last known observation as well as forward carrying. I already messed with that but I can't get it to work. – Aaron Kaijser Jul 17 '21 at 12:06
  • This seems useful `df$values |> dendextend::na_locf(recursive = F) |> dendextend::na_locf(recursive = F)` – Peace Wang Jul 17 '21 at 12:37
  • Related: [Fill NA in a time series only to a limited number](https://stackoverflow.com/questions/25940241/fill-na-in-a-time-series-only-to-a-limited-number) – Henrik Jul 17 '21 at 19:23

3 Answers3

3

You can take help of this function and apply it by group -

library(data.table)
library(zoo)

replace_NA_with_limit <- function(a, n) {
  r <- rle(is.na(a))
  a <- na.locf(a)
  is.na(a) <- sequence(r$lengths) > n & rep(r$values, r$lengths)
  a
}

setDT(df)[, values := replace_NA_with_limit(values, 2), country]
df

#    country values
# 1:     USA      2
# 2:     USA      1
# 3:     USA      1
# 4:     USA      1
# 5:     USA     NA
# 6:      FR      2
# 7:      FR      1
# 8:      FR      2
# 9:      FR      2
#10:      FR      2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

Note that normally when dealing with longer stretches of NA's one either fills them all or none of them and na.locf already handles that using the maxgap argument which only fills gaps that are no longer than specified. The idea is that interpolation is only reliable over short stretches so you should not interpolate at all over longer ones. Nevertheless, the following shows how to implement the scheme in the question but consider whether you should change strategy and use maxgap instead.

1) Compute na.locf using na.locf0 giving locf and also create a grouping variable for stretches of NA's and non-NAs, g. Then for each run of NA's take the first two elements of the na.locf column and fill the remainder with NA's from value. This does not overwrite df so it can be used in a pipeline without side effects.

library(data.table)
library(zoo)

df[, .(values, locf = na.locf0(values), g = rleid(is.na(values))), by = country][
   , .(values = c(head(locf, 2), tail(values, -2))), by = .(country, g)][
   , .(country, values)]

giving:

    country values
 1:     USA      2
 2:     USA      1
 3:     USA      1
 4:     USA      1
 5:     USA     NA
 6:      FR      2
 7:      FR      1
 8:      FR      2
 9:      FR      2
10:      FR      2

2) A slightly modified formulation that still uses the same basic idea is the following. It also does not overwrite.

library(data.table)
library(zoo)

# like na.locf0 but only specifies vector, x, and limit to fill, k
na.locf2 <- function(x, k) {
  nalocf <- na.locf0(x)
  f <- function(ix) c(head(nalocf[ix], k), tail(x[ix], -k))
  unlist(tapply(seq_along(x), rleid(is.na(x)), f))
}
df[, .(values = na.locf2(values, 2)), by = country]

giving:

    country values
 1:     USA      2
 2:     USA      1
 3:     USA      1
 4:     USA      1
 5:     USA     NA
 6:      FR      2
 7:      FR      1
 8:      FR      2
 9:      FR      2
10:      FR      2
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

Here is another option:

library(data.table)
setDT(df)[, ri := rowid(country, values)]
df[!is.na(values) | ri <= 2L, values := nafill(values, "locf")]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35