R: Fill empty cell with value of last non-empty cell

Question

In Excel, it is easy to grab a cell within a column and drag the cursor downward to replace many cells below so that each cell becomes the same value as the original.

This function can be performed in R using a for loop. I spent some time trying to figure it out today, and thought I'd share for the benefit of the next person in my shoes:

for (row in 2:length(data$column)){ # 2 so you don't affect column names
    if(data$column[row] == "") {    # if its empty...
        data$column[row] = data$column[row-1] # ...replace with previous row's value
    }
}

This worked for me, although it took a long time (5-10 mins) to run with a huge data file. Perhaps there is a more efficient way of achieving this function, and I encourage anyone to say how that could be done.

Thanks and good luck.

`library(zoo)` `na.locf()` is faster I believe. – thelatemail Jul 20 '16 at 00:04 — thelatemail, Jul 20 '16 at 00:04
Not a question, so does it belong as one? – Jul 20 '16 at 00:05 — , Jul 20 '16 at 00:05

Sathish · Answer 1 · 2016-07-20T01:30:06.927

df <- data.frame(a = c(1:5, "", 3, "", "", "", 4), stringsAsFactors = FALSE)

> df
   a
1  1
2  2
3  3
4  4
5  5
6   
7  3
8   
9   
10  
11 4

while(length(ind <- which(df$a == "")) > 0){
  df$a[ind] <- df$a[ind -1]
}

> df
   a
1  1
2  2
3  3
4  4
5  5
6  5
7  3
8  3
9  3
10 3
11 4

EDIT: added time profile

set.seed(1)
N = 1e6
df <- data.frame(a = sample(c("",1,2),size=N,replace=TRUE),
                 stringsAsFactors = FALSE)

if(df$a[1] == "") {df$a[1] <- NA}

system.time(
  while(length(ind <- which(df$a == "")) > 0){
    df$a[ind] <- df$a[ind - 1]
  }, gcFirst = TRUE)

user  system elapsed 
0.89    0.00    0.88

Your while loop is a truly beautiful solution in the way it takes advantage of R's vectorization. — 3D0G, May 08 '18 at 15:58

agstudy · Answer 2 · 2016-07-20T00:17:16.133

6

Here fast solution using na.locf from the zoo package applied within data.table. I created a new column y in the result to better visualize the effect of replacing missing values( easy to repalce x column here). Since na.locf replaced missing values , an extra step was needed to replace all zero length values by NA. The solution is very fast and takes less than half second in my machine for 1e6 rows.

library(data.table)
library(zoo)
N=1e6  ##  number of rows 
DT <- data.table(x=sample(c("",1,2),size=N,replace=TRUE))
system.time(DT[!nzchar(x),x:=NA][,y:=na.locf(x)])
## user  system elapsed 
## 0.59    0.30    1.78 
# x y
# 1:  2 2
# 2: NA 2
# 3: NA 2
# 4:  1 1
# 5:  1 1
# ---     
#   999996:  1 1
# 999997:  2 2
# 999998:  2 2
# 999999: NA 2
# 1000000: NA 2

edited Jul 20 '16 at 00:17

answered Jul 20 '16 at 00:12

agstudy

119,832
17
199
261

A minor issue. If there are actual `NA` values in the data, this will replace them as well. As per `dt <- data.table(x=c(1,NA,2,"",""))` for instance. – thelatemail Jul 20 '16 at 00:43
@thelatemail good catch even if `NA` and `""` are in the same spirit of missing values! I wait for the user example and expected result before going further with this answer. – agstudy Jul 20 '16 at 00:52

score 4 · Answer 3 · answered Jun 18 '19 at 12:40

4

Borrowing agstudy's MWE:

library(dplyr)
library(zoo)

N = 1e6
df <- data.frame(x = sample(c(NA,"A","B"), size=N, replace=TRUE))

system.time(test <- df %>% dplyr::do(zoo::na.locf(.)))

   user  system elapsed 
  0.082   0.000   0.130

answered Jun 18 '19 at 12:40

dnavinci

41
2

Cory Overton · Answer 4 · 2022-05-18T19:57:21.060

1

just to provide a more recent update;

tidyr::fill() is faster than lightning

library(dplyr)
library(tidyr)

N = 1e6
df <- data.frame(x = sample(c(NA,"A","B"), size=N, replace=TRUE))

system.time(test <- df %>% tidyr::fill(x))

   user  system elapsed 
   0.01    0.00    0.02

edited May 18 '22 at 19:57

answered May 18 '22 at 19:55

Cory Overton

11
4

R: Fill empty cell with value of last non-empty cell

4 Answers4