0

Part of a funtion I am including in an R-package involves filling NAs with last ovbservation carried forward (locf). The locf should be implemnted to all columns in the data frame except what I called below the good columns goodcols (i.e. should be applied to the badcols). The column names for the badcols can be anything. I use the locf function below and a for-loop to acheive this. However, the for-loop is a bit slow when using large data set. Can anybody suggest a faster alternative or another way of filling in the NAs in the presented scenario?

Here is an example data frame:

#Test df
TIME <- c(0,5,10,15,20,25,30,40,50)
AMT  <- c(50,0,0,0,50,0,0,0,0)
COV1 <- c(10,9,NA,NA,5,5,NA,10,NA)
COV2 <- c(20,15,15,NA,NA,10,NA,30,NA)
ID   <- rep(1, times=length(TIME))

df <- data.frame(ID,TIME,AMT,COV1,COV2)
df <- expand.grid(df)

goodcols <- c("ID","TIME","AMT")
badcols <- which(names(df)%in%goodcols==F)

#----------------------------------------------------
#locf function
locf <- function (x) {
  good <- !is.na(x)
  positions <- seq(length(x))
  good.positions <- good * positions
  last.good.position <- cummax(good.positions)
  last.good.position[last.good.position == 0] <- NA
  x[last.good.position]
}
#------------------------------------------------------
#Now fill in the gaps by locf function
for (i in badcols)
{
  df[,i] <- locf(df[,i])
}
daragh
  • 173
  • 1
  • 11
  • Did you look at the `na.locf` function from the `zoo` package? – Jaap Oct 03 '16 at 06:05
  • [This Q&A](http://stackoverflow.com/questions/26171958/fill-in-missing-values-by-group-in-data-table) might help if speed is an issue. – Jaap Oct 03 '16 at 06:28
  • @ProcrastinatusMaximus The thing is in my case, I only know what i referred to as `goodcols`. The columns names to be imputed are unknown. Therefore, I need something generic that I can use; which is depicted in the for-loop that I have. The na.locf can be used if I actually know the column name to be imputed. – daragh Oct 03 '16 at 06:57

1 Answers1

2

Sorry for writing an answer (not enough reputation to just comment)

But what prevents you from doing as @ProcrastinatusMaximus said? (you can include the zoo call in your loop)

Would look like this:

for (i in badcols)
{
  df[,i] <- zoo::na.locf(df[,i])
}

I am not sure if zoo is faster than your implementation. You would have to try this out. You could also check spacetime::na.locf, imputeTS::na.locf to see which of the existing locf implementations is the fastest.

sm925
  • 2,648
  • 1
  • 16
  • 28
Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55