1

I am working on a dataset with few missing values marked as "?", I have to replace them with the most common value(mode) of that column. But, I want to write a code which runs it for the whole dataset at once.

I have gotten so far -

df <- read.csv("mushroom.txt", na.strings = "?",header=FALSE)

Now, trying to replace all the NA values in the file with the mode of that column. Please help.

pnuts
  • 58,317
  • 11
  • 87
  • 139
Di sha
  • 37
  • 1
  • 7
  • 1
    I think there are a lot of similar questions, start with give a look at [here](http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r) or at least provide a minimal example code. – SabDeM Jun 24 '15 at 18:34

5 Answers5

1
replaceQuestions <- function(vector) {

  mostCommon <- names(sort(table(vector), decreasing = TRUE))[1]

  vector[vector == '?'] <- mostCommon

  vector

}

df <- apply(df, 2, replaceQuestions)

Not reproducible so I'm not sure if this is what you were looking for, but this solves the problem as I've interpreted it.

Frank P.
  • 503
  • 5
  • 21
1

Since you want to replace by the mode of a column you want to operate in a column-wise fashion via apply and use is.na to identify those columns that you want to replace.

apply(df, 2, function(x){ 
    x[is.na(x)] <- names(which.max(table(x)))
    return(x) })

Note that apply returns a matrix, so if you want a data.frame you would need to convert with as.data.frame

cr1msonB1ade
  • 1,716
  • 9
  • 14
1

As you have it in your question, you're replacing NAs with "?" during your csv-reading, so I think this could help:

apply(df,2,function(x) gsub("\\?",names(sort(-table(x,exclude="?")))[1],x))

The exclude part is to avoid selecting the "?", shall it be the most frequent value. The \\ is to escape the special character ? to gsub.

====== EDIT TO ADD ======

gsub will convert everything to text, you'll need to make it back to numeric again:

a<-apply(df,2,function(x) gsub("\\?",names(sort(-table(x,exclude="?")))[1],x))
new_df<-as.data.frame(apply(a,2,as.numeric))

Last line will produce a new data frame

PavoDive
  • 6,322
  • 2
  • 29
  • 55
0

Or:

apply(df, 2, function(x) {
  x[is.na(x)] <- Mode(x[complete.cases(x)])
  x})

This uses the well-known Mode function on SO. Link to the function Is there a built-in function for finding the mode?

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
Community
  • 1
  • 1
Pierre L
  • 28,203
  • 6
  • 47
  • 69
0

use

for (i in ncol(dataframename){
   dataframename[i]=
   ifelse(is.na(dataframename[i]),mode(dataframename[i]),dataframename[i])
}
toy
  • 11,711
  • 24
  • 93
  • 176
Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60