0

I have a dataset where I want to replace NAs with the preceding character string:

d <- data.frame(X = c("one", NA, "two", NA, "three", NA), Y = c(1:6),
                stringsAsFactors = FALSE)
> d
      X Y
1   one 1
2  <NA> 2
3   two 3
4  <NA> 4
5 three 5
6  <NA> 6

I came up with the following solution which seems lousy somehow:

v <- c()

for (i in seq_along(1:nrow(d))){
  v[i] <- ifelse(is.na(d$X[i]) == TRUE, d$X[i-1], d$X[i])
}

d$X2 <- v    
d
      X Y    X2
1   one 1   one
2  <NA> 2   one
3   two 3   two
4  <NA> 4   two
5 three 5 three
6  <NA> 6 three

My question: Is there a better way to do this and how could this be implemented in a dplyr pipe?

Stefan
  • 727
  • 1
  • 9
  • 24
  • You can create a column that is a lag of the X column using the dplyr lag and then you can use an ifelse and don't have to loop over it! – cody_stinson Mar 13 '19 at 21:27
  • @d.b The documentation for `zoo::na.locf` gives a slightly more simplified version of this: `ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])` – Ritchie Sacramento Mar 13 '19 at 21:46
  • 1
    Possible duplicate of [Replacing NAs with latest non-NA value](https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value) and [Last Observation Carried Forward In a data frame?](https://stackoverflow.com/questions/2776135/last-observation-carried-forward-in-a-data-frame) – Ritchie Sacramento Mar 13 '19 at 21:49

2 Answers2

2

tidyr has a function fill that fills in NAs with the closest non-missing value above it.

If you're fine filling in values in X in place:

library(dplyr)
library(tidyr)

d %>%
  fill(X)
#>       X Y
#> 1   one 1
#> 2   one 2
#> 3   two 3
#> 4   two 4
#> 5 three 5
#> 6 three 6

Or if you need to keep the original X with its missing values, copy it over to another column, and fill that one in:

d %>%
  mutate(X2 = X) %>%
  fill(X2)
#>       X Y    X2
#> 1   one 1   one
#> 2  <NA> 2   one
#> 3   two 3   two
#> 4  <NA> 4   two
#> 5 three 5 three
#> 6  <NA> 6 three
camille
  • 16,432
  • 18
  • 38
  • 60
1

How about this one? Simplifying your using apply family: If you want to create a new column

d$X2 <- unlist(lapply(1:nrow(d), function(x){
                  ifelse(is.na(d[x,]$X), d[x-1,]$X, d[x,]$X)
                  }
       ))

If you just want to fill the original

d$X <- unlist(lapply(1:nrow(d), function(x){
                  ifelse(is.na(d[x,]$X), d[x-1,]$X, d[x,]$X)
                  }
       ))
LocoGris
  • 4,432
  • 3
  • 15
  • 30