2

I have the following dummy dataframe:

col1 = c("aa", NA, NA, NA, NA, NA, NA
        , "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc"
        , "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa")
col2 = c("aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa"
         , NA, NA, NA, NA, NA, NA, NA, NA, NA
         , "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb")
col3 = c("aa", "bb", "bb"
         , NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
         , NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
col4 = c(NA, NA, NA, 4:27)
col5 = c(28:51, NA, NA, NA)

# Construct the data frame with NAs in categorical and numeric columns
df = data.frame("col1" = col1, "col2" = col2, "col3" = col3
                , "col4" = col4, "col5" = col5, stringsAsFactors = FALSE)

I would like to understand how to write a function to impute only categorical values i.e. col1, col2, col3 by using the simple rules:

  1. impute categorical NA column values with the most frequent values in that column
  2. in case of ties choose the alphabetically first value i.e. aa has preference over bb (in the case for col2)

Could anyone please assist in writing a function which takes in df as an input and returns the imputed dataframe for categorical values only. col4, col5 should remain unchanged (They have NAs but are numeric so should be ignored).

Clarification For this example:

  1. col1 NAs should be imputed to be "aa"
  2. col2 NAs should be imputed to be "aa" (by alphabetic preference in ties)
  3. col3 NAs should be imputed to be "bb"

Thanks

user4687531
  • 1,021
  • 15
  • 30
  • 1
    what's your effort so far? – G. Cocca Apr 02 '16 at 19:31
  • @G.Cocca - so far I have the following: ```tt <- table(df$col1) names(tt[which.max(tt)])``` Which gets the max values. Not sure how to classify cols efficiently as categorical and then replace the NAs though. This was based on http://stackoverflow.com/questions/18433647/how-to-get-the-most-frequent-level-of-a-categorical-variable-in-r – user4687531 Apr 02 '16 at 19:35

2 Answers2

4

We could create an index for non-numeric columns

i1 <- !sapply(df, is.numeric)

Create a function for Mode

Mode <- function(x) { 
      ux <- sort(unique(x))
      ux[which.max(tabulate(match(x, ux)))] 
}

and replace the NAs in character columns with the most frequent value

df[i1] <- lapply(df[i1], function(x)
              replace(x, is.na(x), Mode(x[!is.na(x)])))
akrun
  • 874,273
  • 37
  • 540
  • 662
2

You can use this solution:

df[, sapply(df, function(x) !is.numeric(x))] <- apply(df[, sapply(df, function(x) !is.numeric(x))], 2, function(x) {x[is.na(x)] <- names(sort(table(x), decreasing = TRUE)[1]); x})
   col1 col2 col3 col4 col5
1    aa   aa   aa   NA   28
2    aa   aa   bb   NA   29
3    aa   aa   bb   NA   30
4    aa   aa   bb    4   31
5    aa   aa   bb    5   32
6    aa   aa   bb    6   33
7    aa   aa   bb    7   34
8    cc   aa   bb    8   35
9    cc   aa   bb    9   36
10   cc   aa   bb   10   37
11   cc   aa   bb   11   38
12   cc   aa   bb   12   39
13   cc   aa   bb   13   40
14   cc   aa   bb   14   41
15   cc   aa   bb   15   42
16   cc   aa   bb   16   43
17   aa   aa   bb   17   44
18   aa   aa   bb   18   45
19   aa   bb   bb   19   46
20   aa   bb   bb   20   47
21   aa   bb   bb   21   48
22   aa   bb   bb   22   49
23   aa   bb   bb   23   50
24   aa   bb   bb   24   51
25   aa   bb   bb   25   NA
26   aa   bb   bb   26   NA
27   aa   bb   bb   27   NA
Gopala
  • 10,363
  • 7
  • 45
  • 77
  • thanks. Just for my understanding, which part checks for categorical vs numeric? – user4687531 Apr 02 '16 at 19:58
  • Oops...sorry, I did not include that. I will edit. I just took first 3 columns. – Gopala Apr 02 '16 at 19:59
  • Edited: `sapply(df, function(x) !is.numeric(x))` checks for non-numeric columns and allows subsetting just those columns. If you explicitly want character columns, you can modify the condition from `!is.numeric()` to `is.character()`. – Gopala Apr 02 '16 at 20:03