1

I'm trying to write a function to replace missing values in columns with median, and that this works for both factors/characters and numerical values.

library(dplyr)
test = data.frame(a=1:6,b=c("a","b",NA,NA,NA,"c"),c=c(1,1,1,1,2,NA),d=c("a","a","c",NA,NA,"b"))

fun_rep_na = function(df){
  for(i in colnames(df)){
    j<-sym(i)
    df = df %>% mutate(!!j=if_else(is.na(!!j),median(!!j, na.rm=TRUE),!!j))
  }
}

I see that tidyr has a function called replace_na, but I'm not sure how to use this either. Anyway, a custom function is what I would like.

The code above gives me an error.

Helen
  • 533
  • 12
  • 37

2 Answers2

1

We can use mutate_if with median as median works only on numeric columns

test %>% 
   mutate_if(is.numeric, list(~ replace(., is.na(.), median(., na.rm = TRUE))))

If we want the value most repeated, then we may need Mode

Mode <- function(x) {
  x <- x[!is.na(x)]
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

The Mode function was first updated here

test %>% 
  mutate_all(list(~ replace(., is.na(.), Mode(.))))
#  a b c d
#1 1 a 1 a
#2 2 b 1 a
#3 3 a 1 c
#4 4 a 1 a
#5 5 a 2 a
#6 6 c 1 b
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Ah that looks beautiful. Could you explain what the ~ does and all the dots? – Helen Jun 19 '19 at 07:30
  • 1
    @Helen It is a tilda operator used to make the anonymous function call concise in tidyverse, otherwise, you may need `function(x) replace(x, is.na(x),` – akrun Jun 19 '19 at 07:31
  • What is the anonymous function and what is concise? Could you point me in the right direction where to read about this? – Helen Jun 19 '19 at 07:57
  • 1
    @Helen An anonymous function is one used in many languages (sometimes, known as `lambda` function). For e.g. with python it would be `lambda x: x * 5` or using R, `lapply(1:5, function(x) x: (x + 5))` If you have checked `map` and other functions, the conciseness is `map(1:5, ~ .x:(.x + 5))` – akrun Jun 19 '19 at 07:59
1

I think you are looking for Mode and not median

Taking Mode function from here

Mode <- function(x) {
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
}

library(dplyr)

test %>%  mutate_all(~replace(., is.na(.), Mode(na.omit(.))))

#  a b c d
#1 1 a 1 a
#2 2 b 1 a
#3 3 a 1 c
#4 4 a 1 a
#5 5 a 2 a
#6 6 c 1 b
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213