1

Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?

# 1. Create data frame with some NA values. 

rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)  
df2 <- df

# 2. Run for loop to replace NAs with that row's mean.

for(i in 1:3){            # for every row
x <- as.numeric(df[i,])   # subset/extract that row into a numeric vector
y <- is.na(x)             # create logical vector of NAs
z <- !is.na(x)            # create logical vector of non-NAs
result <- mean(x[z])      # get the mean value of the row 
df2[i,y] <- result        # replace NAs in that row
}

# 3. Show output with imputed row mean values.

print(df)  # before
print(df2) # after 
Bridgbro
  • 269
  • 1
  • 3
  • 17
  • 1
    you should always use `set.seed` when you provide data with random number generation – mlegge Aug 12 '15 at 20:52
  • 2
    @akrun, nice find. It seems the answer there is exactly the same like mine. Oh well, great minds think alike I guess :) – David Arenburg Aug 12 '15 at 21:09
  • @akrun imo, this question is not that identical... no answer was accepted on the other question by the OP. ;) I do think it helps others learn by seeing different ways of approaching and asking a related question, especially in R. The answer explanations and the structure of this question, I believe, has some value. – Bridgbro Aug 13 '15 at 00:33
  • 1
    Ok, then it is reopened. – akrun Aug 13 '15 at 00:36

3 Answers3

6

Here's a possible vectorized approach (without any loop)

indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]

Some explanation

We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
3

One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,

library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))

Also, you can hide the loop in an apply

t(apply(df2, 1, function(x) {
    mu <- mean(x, na.rm=T)
    x[is.na(x)] <- mu
    x
}))
Rorschach
  • 31,301
  • 5
  • 78
  • 129
3

Data:

set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)

This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.

rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453