Replacing or imputing NA values in R without For Loop

Question

Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?

# 1. Create data frame with some NA values. 

rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)  
df2 <- df

# 2. Run for loop to replace NAs with that row's mean.

for(i in 1:3){            # for every row
x <- as.numeric(df[i,])   # subset/extract that row into a numeric vector
y <- is.na(x)             # create logical vector of NAs
z <- !is.na(x)            # create logical vector of non-NAs
result <- mean(x[z])      # get the mean value of the row 
df2[i,y] <- result        # replace NAs in that row
}

# 3. Show output with imputed row mean values.

print(df)  # before
print(df2) # after

you should always use `set.seed` when you provide data with random number generation — mlegge, Aug 12 '15 at 20:52
@akrun, nice find. It seems the answer there is exactly the same like mine. Oh well, great minds think alike I guess :) — David Arenburg, Aug 12 '15 at 21:09
@akrun imo, this question is not that identical... no answer was accepted on the other question by the OP. ;) I do think it helps others learn by seeing different ways of approaching and asking a related question, especially in R. The answer explanations and the structure of this question, I believe, has some value. — Bridgbro, Aug 13 '15 at 00:33

score 6 · Accepted Answer · answered Aug 12 '15 at 21:04

Here's a possible vectorized approach (without any loop)

indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]

Some explanation

We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly

Rorschach · Answer 2 · 2015-08-12T21:18:26.007

3

One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,

library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))

Also, you can hide the loop in an apply

t(apply(df2, 1, function(x) {
    mu <- mean(x, na.rm=T)
    x[is.na(x)] <- mu
    x
}))

edited Aug 12 '15 at 21:18

answered Aug 12 '15 at 20:59

Rorschach

31,301
5
78
129

Ben Bolker · Answer 3 · 2015-08-12T21:11:40.367

3

Data:

set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)

This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.

rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))

edited Aug 12 '15 at 21:11

answered Aug 12 '15 at 21:04

Ben Bolker

211,554
25
370
453

1

because `rdata` and `df` are basically the same (one is a matrix, the other a data frame) – Ben Bolker Aug 12 '15 at 21:12

Replacing or imputing NA values in R without For Loop

3 Answers3