2

I used the following code to turn each i-th NA in the variable x into the (i-1)-th value of the variable x and it works fine but it takes too much time, since the dataset is large.

for (i in 2:nrow(data_final)) {
  data_final$COD_ATC5[i] <- ifelse(is.na(data_final$COD_ATC5[i]), data_final$COD_ATC5[i-1], data_final$COD_ATC5[i])
}

Do you have other faster idea?

Here a reproducible example of the dataset:

data_final <- data.frame(ID=c(rep("01",12),rep("02",12)), t = rep(1:12,2), x= c(rep("A",4),NA,rep("A",3),rep("C",4),rep("A",5),rep("C",3),NA,"C",rep("A",2)))
jeff
  • 323
  • 1
  • 7

1 Answers1

0

We can determine the indices idx first using which and then replace only these indices with [idx-1]. The function ByWhich shows how it works.

# Sample data
data_final <- data.frame(ID=c(rep("01",12),rep("02",12)), t = rep(1:12,2), x= c(rep("A",3), "B", NA, rep("A",3),rep("C",4),rep("A",5),rep("C",3),NA,"C",rep("A",2)))

# New solution
ByWhich <- function(x) {
  idx <- which(is.na(x))
  x[idx] <- x[idx-1]
  return(x)
}

# Solution by asker
ByLoop <- function(x) {
  for (i in 2:length(x)) {
    x[i] <- ifelse(is.na(x[i]), x[i-1], x[i])
  }
  return(x)
}

# Test if the functions provide equal solutions
all(ByLoop(data_final$x) == ByWhich(data_final$x))
#> [1] TRUE

The benchmark shows that the solution using which is faster by about 40%.

library(microbenchmark)
microbenchmark::microbenchmark(
  ByWhich = ByWhich(data_final$x),
  ByLoop  = ByLoop(data_final$x)
)
#> Unit: microseconds
#>     expr    min      lq     mean  median      uq      max neval
#>  ByWhich  2.001  2.1010 23.60294  2.4010  2.5010 2124.802   100
#>   ByLoop 35.400 36.2515 37.16908 37.0005 37.5015   42.301   100

This solution does not require an extra package. However, the zoo or tidyverse solutions provided in the comments are probably even faster.

Created on 2021-05-21 by the reprex package (v2.0.0)

Jan
  • 4,974
  • 3
  • 26
  • 43