Impute most frequent categorical value in all columns in data frame

Question

I have the following dummy dataframe:

col1 = c("aa", NA, NA, NA, NA, NA, NA
        , "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc"
        , "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa")
col2 = c("aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa"
         , NA, NA, NA, NA, NA, NA, NA, NA, NA
         , "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb")
col3 = c("aa", "bb", "bb"
         , NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
         , NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
col4 = c(NA, NA, NA, 4:27)
col5 = c(28:51, NA, NA, NA)

# Construct the data frame with NAs in categorical and numeric columns
df = data.frame("col1" = col1, "col2" = col2, "col3" = col3
                , "col4" = col4, "col5" = col5, stringsAsFactors = FALSE)

I would like to understand how to write a function to impute only categorical values i.e. col1, col2, col3 by using the simple rules:

impute categorical NA column values with the most frequent values in that column
in case of ties choose the alphabetically first value i.e. aa has preference over bb (in the case for col2)

Could anyone please assist in writing a function which takes in df as an input and returns the imputed dataframe for categorical values only. col4, col5 should remain unchanged (They have NAs but are numeric so should be ignored).

Clarification For this example:

col1 NAs should be imputed to be "aa"
col2 NAs should be imputed to be "aa" (by alphabetic preference in ties)
col3 NAs should be imputed to be "bb"

Thanks

@G.Cocca - so far I have the following: ```tt <- table(df$col1) names(tt[which.max(tt)])``` Which gets the max values. Not sure how to classify cols efficiently as categorical and then replace the NAs though. This was based on http://stackoverflow.com/questions/18433647/how-to-get-the-most-frequent-level-of-a-categorical-variable-in-r — user4687531, Apr 02 '16 at 19:35

akrun · Accepted Answer · 2016-04-02T19:59:18.747

4

We could create an index for non-numeric columns

i1 <- !sapply(df, is.numeric)

Create a function for Mode

Mode <- function(x) { 
      ux <- sort(unique(x))
      ux[which.max(tabulate(match(x, ux)))] 
}

and replace the NAs in character columns with the most frequent value

df[i1] <- lapply(df[i1], function(x)
              replace(x, is.na(x), Mode(x[!is.na(x)])))

edited Apr 02 '16 at 19:59

answered Apr 02 '16 at 19:38

akrun

874,273
37
540
662

Gopala · Answer 2 · 2016-04-02T20:02:41.273

2

You can use this solution:

df[, sapply(df, function(x) !is.numeric(x))] <- apply(df[, sapply(df, function(x) !is.numeric(x))], 2, function(x) {x[is.na(x)] <- names(sort(table(x), decreasing = TRUE)[1]); x})
   col1 col2 col3 col4 col5
1    aa   aa   aa   NA   28
2    aa   aa   bb   NA   29
3    aa   aa   bb   NA   30
4    aa   aa   bb    4   31
5    aa   aa   bb    5   32
6    aa   aa   bb    6   33
7    aa   aa   bb    7   34
8    cc   aa   bb    8   35
9    cc   aa   bb    9   36
10   cc   aa   bb   10   37
11   cc   aa   bb   11   38
12   cc   aa   bb   12   39
13   cc   aa   bb   13   40
14   cc   aa   bb   14   41
15   cc   aa   bb   15   42
16   cc   aa   bb   16   43
17   aa   aa   bb   17   44
18   aa   aa   bb   18   45
19   aa   bb   bb   19   46
20   aa   bb   bb   20   47
21   aa   bb   bb   21   48
22   aa   bb   bb   22   49
23   aa   bb   bb   23   50
24   aa   bb   bb   24   51
25   aa   bb   bb   25   NA
26   aa   bb   bb   26   NA
27   aa   bb   bb   27   NA

edited Apr 02 '16 at 20:02

answered Apr 02 '16 at 19:46

Gopala

10,363
7
45
77

thanks. Just for my understanding, which part checks for categorical vs numeric? – user4687531 Apr 02 '16 at 19:58
Oops...sorry, I did not include that. I will edit. I just took first 3 columns. – Gopala Apr 02 '16 at 19:59
Edited: `sapply(df, function(x) !is.numeric(x))` checks for non-numeric columns and allows subsetting just those columns. If you explicitly want character columns, you can modify the condition from `!is.numeric()` to `is.character()`. – Gopala Apr 02 '16 at 20:03

Impute most frequent categorical value in all columns in data frame

2 Answers2