0

I am required to build a function which uses mean to replace missing values for continuous/integer variables and uses mode to replace missing values for categorical variables.

The data comes from credit screening dataset

X <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", header = FALSE, na.strings = '?')

The first column of the dataset is of factor type, second and third columns are numeric.....

I built a mode function

mode_function <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Which works as intended.

The overall function that I am using on the dataset is

broken <- function(data){
  for(i in 1:ncol(data)){
    if(is.factor(data[,i])){
      data[is.na(data[,i]),i] <- mode_function(data[,i])
    }
    else{
      data[is.na(data[,i]),i] <- mean(data[,i], na.rm = TRUE)
    }
  }
  return(data)
}

Problem: I run this function and nothing changes in my dataset. I still have the same number of missing values as I did before the function was run.

This line outside of the function works just as intended. The same with the code that deals with mean.

data[is.na(data[,i]),i] <- mode_function(data[,i])

But once I try to use my function to perform the exact same operations nothing happens.

  • It is easier to help if you give a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and an expected output. – Ronak Shah Sep 28 '17 at 02:22

1 Answers1

0

The most likely reason for "nothing happening" is failing to assign a result to an R name/symbol. Perhaps trying this:

 maybe_res <- broken(data)

Chaeck this:

> sapply(X, function(x) sum(is.na(x)))
 V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 
 12  12   0   6   6   9   9   0   0   0   0   0   0  13   0   0 
> sapply( broken(X), function(x) sum(is.na(x)))
 V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 

I should warn you that mode functions are notorious for delivering answers that may not be what are desired.

IRTFM
  • 258,963
  • 21
  • 364
  • 487