1

Trying to recode NAs in a very large dataframe with lot's of columns. I've stored the column names in a character vector (num_var) and replacement values for the different columns in a named vector (median.to.replace). In each column the NAs should be replaced with the correct value from median.to.replace.

No problem to run the code manually by using the code inside the seq_along loop and specify each column name manually

However when I try this simple code all the NAs don't get recoded and some NAs get replaced by an incorrect value??

for (name_col in seq_along(num_var))
{
  na_rows <- is.na(allProspect.tst[,name_col]) 
  allProspect.tst[na_rows,name_col] <- median.to.replace[name_col]

}

Anyone has a pointer to what's wrong? Trying to use fast and memory efficient approach to this

wabe
  • 93
  • 5

2 Answers2

1

This will work a lot faster if you use a data.table rather than data.frame. Here I created a random data set with missing values from the mtcars dataset then used a lookup table to replace those missing values.

library(data.table)
set.seed(44)
f_dowle<-function(DT,value=-1,col) { #copied and edited this function from elsewhere
  set(DT,which(is.na(DT[[col]])),col,value)
}

data(mtcars)

setDT(mtcars)

for(i in colnames(mtcars)){
  rand_na<-sample(1:nrow(mtcars),3)
  mtcars[rand_na,eval(as.name(i)):=NA]

}
head(mtcars) #showing random missing values

        mpg cyl disp  hp drat    wt  qsec vs am gear carb
1: 21.0  NA  160  NA 3.90 2.620 16.46  0  1    4    4
2: 21.0   6   NA 110 3.90    NA 17.02  0  1    4    4
3: 22.8   4  108  NA 3.85 2.320 18.61  1  1    4    1
4: 21.4   6   NA 110 3.08 3.215 19.44  1  0    3    1
5: 18.7  NA  360 175   NA 3.440 17.02  0  0    3    2
6: 18.1   6  225 105 2.76    NA 20.22  1  0    3    1

lkp_dt<-data.table(column=colnames(mtcars),value=1:11)
for(i in colnames(mtcars)){
  value=lkp_dt[column==i,value]
  f_dowle(mtcars,value=value,col=i)

}

head(mtcars) #missing values replaced

    mpg cyl disp  hp drat    wt  qsec vs am gear carb
1: 21.0   2  160   4 3.90 2.620 16.46  0  1    4    4
2: 21.0   6    3 110 3.90 6.000 17.02  0  1    4    4
3: 22.8   4  108   4 3.85 2.320 18.61  1  1    4    1
4: 21.4   6    3 110 3.08 3.215 19.44  1  0    3    1
5: 18.7   2  360 175 5.00 3.440 17.02  0  0    3    2
6: 18.1   6  225 105 2.76 6.000 20.22  1  0    3    1
Jason
  • 1,559
  • 1
  • 9
  • 14
  • Nice example! My first column to recode does not start at "1" however, and the column numbers are not continuous. Don't quite understand how to adapt your example to this situation. For example, I would like to recode column "3", "7", "22", "97" with median.to.replace[1,2,3,4] In reality many more columns – wabe Jul 21 '16 at 15:29
  • @wabe to answer the columns to include - you can manually write a list or check for columns that have missing values. As for the lookup table, this can be anything you'd like. If you want to recode with median values of those columns, you can either put that in a lookup table or calc directly from the table itself using the function. – Jason Jul 21 '16 at 15:36
1

According to your comment that the vector num_var does not start at the first column of the data frame and is not contiguous, then you need this

# simple example with just four columns
allProspect.tst <- data.frame(one=c(1:3,8), two=c(NA,4:6), three=1:4, four= c(5,NA,7, 8))
# want to replace NAs in columns "two" and "four" with values 5 and 7, respectively
num_var <- c("two","four")
median.to.replace <- c(5, 7)
# let's see the data before replacement
print(allProspect.tst)
##  one two three four
##1   1  NA     1    5
##2   2   4     2   NA
##3   3   5     3    7
##4   8   6     4    8

# just loop over the collection of column names (not indices)
for (name_col in num_var) {
  na_rows <- is.na(allProspect.tst[,name_col])
  # key is to get the corresponding element in median.to.replace 
  # using which() index in num_var has value equal name_col
  allProspect.tst[na_rows,name_col] <- median.to.replace[which(num_var==name_col)]
}
# now let's see the replaced data
print(allProspect.tst)
##  one two three four
##1   1   5     1    5
##2   2   4     2    7
##3   3   5     3    7
##4   8   6     4    8

Update: making it more efficient

There are many ways to make the replacement operation more efficient for a large number of columns, but the most basic uses the *apply family of functions, look here for an excellent overview, from the R base package. The updated code is as follows:

replace.with.median <- function(col, median.val, df) {
  na_rows <- is.na(df[, col])
  df[na_rows, col] <- median.val  
  return(df[, col])
}
allProspect.tst[, num_var] <- mapply(replace.with.median, num_var, median.to.replace, 
                                     MoreArgs=list(df=allProspect.tst))
print(allProspect.tst)
##  one two three four
##1   1   5     1    5
##2   2   4     2    7
##3   3   5     3    7
##4   8   6     4    8

Notes:

  1. The body of the original for loop is encapsulated in the function replace.with.median. The input arguments are:

    • col: a column name to find NAs to replace
    • median.val: the corresponding replacement value from median.to.replace
    • df: the data frame containing the data

    This function returns the col column from df whose NAs are replaced with median.val.

  2. Use mapply, which according to the link above:

    For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc.,

    Here, we want to apply the function replace.with.median over the two vectors num_var and median.to.replace in "lock-step" to each other. In addition, we provide the data frame allProspect.tst to replace.with.median through the MoreArgs argument of mapply.

  3. What gets returned from mapply is the collection of column vectors that have their NAs replaced. We then replace the corresponding columns of allProspect.tst with these.

Hope this helps.

Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18
  • Very helpful! I modified my code and used the correct loop construct like you suggested. Ran quite efficient, just a few seconds to recode around 9 mil. rows and 40 columns. Found "mapply" myself but couldn't figure out how to apply it on my problem. Your example really helped me understand how to apply it. – wabe Jul 22 '16 at 05:52