1

I'm attempting to write a function to replace missing numeric values in the data frame with the median value of the numeric value. As well, I need to replace the missing characters with the value of the highest frequency of the character variables.

It needs to be accomplished without the use of any packages.

The data looks like this:

 ID GLUC TGL HDL LDL  HRT MAMM SMOKE
1  A   88  NA  32  99    Y <NA>  ever
2  B   NA 150  60  NA <NA>   no never
3  C  110  NA  NA 120    N <NA>  <NA>
4  D   NA 200  65 165 <NA>  yes never
5  E   90 210  NA 150    Y <NA> never
6  F   88  NA  32 210 <NA>  yes  ever

EDIT

This is what I have so far and I'm not sure if I'm even close ...

impute<- function(dat, varlist) {
  if (is.numeric(varlist)) {
    res <- median(varlist, na.rm = TRUE)
  }
  else {
    res <- dat[which.max(varlist)]
  }
  na.index <- which(is.na(varlist))
  dat[na.index] <- res
  dat
}
SecretAgentMan
  • 2,856
  • 7
  • 21
  • 41
fiverings84
  • 153
  • 6
  • Sorry about the improper tag. I just made an edit to the original post to provide my very poor attempt at figuring this out. I'm completely lost. – fiverings84 Dec 06 '20 at 03:43
  • I would have two side comments. First it's a bit strange to do data imputation without the support of all the data imputation packages existing out there. It sounds a bit like when professor teach C++ ... without the support of the standard library ^^ Second, if I remember a seminar given few years ago, imputing data with the mean or median can destroy the correlation structure existing in your data (so maybe you don't really wanna do that depending of your use case) :) – WaterFox Dec 06 '20 at 03:49

1 Answers1

2

You could write a function like this :

impute <- function(data, varlist) {
  data[varlist] <- lapply(data[varlist], function(x) {
    if(is.numeric(x)) x[is.na(x)] <- median(x, na.rm = TRUE)
    else x[is.na(x)] <- Mode(na.omit(x))
    return(x)
  })
  return(data[varlist])
}

impute(df, c('GLUC', 'HRT'))

#  GLUC HRT
#1   88   Y
#2   89   Y
#3  110   N
#4   89   Y
#5   90   Y
#6   88   Y

where Mode is function taken from here :

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • That almost works. However, when I input impute(dat = patient, varlist = "HRT"), I need the output to be just the information for HRT. Right now, it does it for HRT but the output also includes all the other variables (without the imputation done to them). – fiverings84 Dec 06 '20 at 03:56
  • maybe just return data[varlist] ? – WaterFox Dec 06 '20 at 03:58
  • Amazing. Thanks so much! I'm particularly fascinated with how the Mode part was used. Love learning new things. – fiverings84 Dec 06 '20 at 04:00