13

I wish to implement a "Last Observation Carried Forward" for a data set I am working on which has missing values at the end of it.

Here is a simple code to do it (question after it):

LOCF <- function(x)
{
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}


# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))

Now this works great for simple vectors. But if I where to try and use it on a data frame:

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess

It will turn my data frame into a character matrix.

Can you think of a way to do LOCF on a data.frame, without turning it into a matrix? (I could use loops and such to correct the mess, but would love for a more elegant solution)

M--
  • 25,431
  • 8
  • 61
  • 93
Tal Galili
  • 24,605
  • 44
  • 129
  • 187

7 Answers7

23

This already exists:

library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))
Shane
  • 98,550
  • 35
  • 224
  • 217
  • 2
    +1 and rseek.org of course immediately hits this as first results. – Dirk Eddelbuettel May 05 '10 at 19:34
  • My bid for not rseeking it - thanks Shane. But I am afraid it doesn't do the job. (it fills column 3, instead of each row) – Tal Galili May 05 '10 at 19:45
  • 1
    You could have also found this if you searched stackoverflow.com for `[r] locf`. – Shane May 05 '10 at 19:47
  • Hi Shane, I also wasn't able to find solution in that search (Although this thread is nice: http://stackoverflow.com/questions/1782704/propagating-data-within-a-vector/1783275#1783275 ) – Tal Galili May 05 '10 at 19:53
  • Look at the accepted answer to that thread. That's what I was referring to. I don't think this question is a duplicate because the other questioner was asking about vectors and you're asking about data frames, but they're very closely related (and the answer is the same). – Shane May 05 '10 at 20:06
  • Hi Shane, the function can be used like this: t(na.locf(t(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))))) But it will not "solve" the question, since I would need to go through the resulting "matrix" and turn it back to a data.frame. And thanks for taking the time to help :) Tal – Tal Galili May 05 '10 at 20:38
  • Oh...you want to carry column values "forward"? That isn't usually what people do. An "observation" is a row value in R, so LOCF means carry row values downward. You're carrying values across columns. I can't even imagine a circumstance in which one would do that? – Shane May 05 '10 at 20:50
  • Hi Shane, it's very simple. I have a wide (instead of long) data.frame. I can turn it to long and then use a function from the other SO thread. The only problem with that would be the case of a the first value being missing... – Tal Galili May 05 '10 at 20:59
  • 1
    If the first value is missing, then you can make a judgement about what to do to handle it. No function will solve that problem for you. You will need to either leave the whole thing as missing, or set a default first value (like zero, for instance). – Shane May 05 '10 at 21:01
  • I don't see why turning the matrix back to a data.frame with `data.frame(t(na.locf(t(dat))))` should be a problem. And following `na.locf(dat)` with `na.locf(dat, fromLast = TRUE)` should carry next observations backward (NOCB) and fill first missing values. No? So: `data.frame(t(na.locf(na.locf(t(dat)),fromLast=T)))` –  Oct 18 '14 at 11:48
11

If you do not want to load a big package like zoo just for the na.locf function, here is a short solution which also works if there are some leading NAs in the input vector.

na.locf <- function(x) {
  v <- !is.na(x)
  c(NA, x[v])[cumsum(v)+1]
}
Henrik Seidel
  • 301
  • 3
  • 3
  • I like this solution best. If you want to apply it to a `data.frame` like in the original question, you can use it via `a[]=lapply(a,na.locf)`. – cryo111 Dec 14 '17 at 14:10
10

Adding the new tidyr::fill() function for carrying forward the last observation in a column to fill in NAs:

a <- data.frame(col1 = rep("a",4), col2 = 1:4, 
                col3 = 1:4, col4 = c(1,NA,NA,NA))
a
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2   NA
# 3    a    3    3   NA
# 4    a    4    4   NA

a %>% tidyr::fill(col4)
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2    1
# 3    a    3    3    1
# 4    a    4    4    1
Prradep
  • 5,506
  • 5
  • 43
  • 84
6

There are a bunch of packages implementing exactly this functionality. (with same basic functionality, but some differences in additional options)

  • spacetime::na.locf
  • imputeTS::na_locf
  • zoo::na.locf
  • xts::na.locf
  • tidyr::fill

Added a benchmark of these methods for @Alex:

I used the microbenchmark package and the tsNH4 time series, which has 4552 observations. These are the results: enter image description here

So for this case na_locf from imputeTS was the fastest - closely followed by na.locf0 from zoo. The other methods were significantly slower. But be careful it is only a benchmark made with one specific time series. (added the code that you can test for your specific use case)

Results as a plot: enter image description here

Here is the code, if you want to recreate the benchmark with a self selected time series:

library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)

# Create a data.frame from tsNH series 
df <- as.data.frame(tsNH4)

res <- microbenchmark(imputeTS::na_locf(tsNH4),
                    zoo::na.locf0(tsNH4),
                    zoo::na.locf(tsNH4), 
                    tidyr::fill(df, everything()), 
                    spacetime::na.locf(tsNH4), 
                    times = 100)
ggplot2::autoplot(res)

plot(res)

# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())
Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
2

This question is old but for posterity... the best solution is to use data.table package with the roll=T.

Dave31415
  • 2,846
  • 4
  • 26
  • 34
0

I ended up solving this using a loop:

fillInTheBlanks <- function(S) {
  L <- !is.na(S)
  c(S[L][1], S[L])[cumsum(L)+1]
}


LOCF.DF <- function(xx)
{
    # won't work well if the first observation is NA

    orig.class <- lapply(xx, class)

    new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))

    for(i in seq_along(orig.class))
    {
        if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
        if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
        if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])   
    }

    #t(na.locf(t(a)))

    return(new.xx)
}

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)
Tal Galili
  • 24,605
  • 44
  • 129
  • 187
0

Instead of apply() you can use lapply() and then transform the resulting list to data.frame.

LOCF <- function(x) {
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}

a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))
djhurio
  • 5,437
  • 4
  • 27
  • 48