According to your comment that the vector num_var
does not start at the first column of the data frame and is not contiguous, then you need this
# simple example with just four columns
allProspect.tst <- data.frame(one=c(1:3,8), two=c(NA,4:6), three=1:4, four= c(5,NA,7, 8))
# want to replace NAs in columns "two" and "four" with values 5 and 7, respectively
num_var <- c("two","four")
median.to.replace <- c(5, 7)
# let's see the data before replacement
print(allProspect.tst)
## one two three four
##1 1 NA 1 5
##2 2 4 2 NA
##3 3 5 3 7
##4 8 6 4 8
# just loop over the collection of column names (not indices)
for (name_col in num_var) {
na_rows <- is.na(allProspect.tst[,name_col])
# key is to get the corresponding element in median.to.replace
# using which() index in num_var has value equal name_col
allProspect.tst[na_rows,name_col] <- median.to.replace[which(num_var==name_col)]
}
# now let's see the replaced data
print(allProspect.tst)
## one two three four
##1 1 5 1 5
##2 2 4 2 7
##3 3 5 3 7
##4 8 6 4 8
Update: making it more efficient
There are many ways to make the replacement operation more efficient for a large number of columns, but the most basic uses the *apply
family of functions, look here for an excellent overview, from the R base
package. The updated code is as follows:
replace.with.median <- function(col, median.val, df) {
na_rows <- is.na(df[, col])
df[na_rows, col] <- median.val
return(df[, col])
}
allProspect.tst[, num_var] <- mapply(replace.with.median, num_var, median.to.replace,
MoreArgs=list(df=allProspect.tst))
print(allProspect.tst)
## one two three four
##1 1 5 1 5
##2 2 4 2 7
##3 3 5 3 7
##4 8 6 4 8
Notes:
The body of the original for
loop is encapsulated in the function replace.with.median
. The input arguments are:
col
: a column name to find NA
s to replace
median.val
: the corresponding replacement value from median.to.replace
df
: the data frame containing the data
This function returns the col
column from df
whose NA
s are replaced with median.val.
Use mapply
, which according to the link above:
For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc.,
Here, we want to apply the function replace.with.median
over the two vectors num_var
and median.to.replace
in "lock-step" to each other. In addition, we provide the data frame allProspect.tst
to replace.with.median
through the MoreArgs
argument of mapply
.
- What gets returned from
mapply
is the collection of column vectors that have their NA
s replaced. We then replace the corresponding columns of allProspect.tst
with these.
Hope this helps.