I have the following dummy dataframe:
col1 = c("aa", NA, NA, NA, NA, NA, NA
, "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc"
, "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa")
col2 = c("aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa"
, NA, NA, NA, NA, NA, NA, NA, NA, NA
, "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb")
col3 = c("aa", "bb", "bb"
, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
col4 = c(NA, NA, NA, 4:27)
col5 = c(28:51, NA, NA, NA)
# Construct the data frame with NAs in categorical and numeric columns
df = data.frame("col1" = col1, "col2" = col2, "col3" = col3
, "col4" = col4, "col5" = col5, stringsAsFactors = FALSE)
I would like to understand how to write a function to impute only categorical values i.e. col1, col2, col3
by using the simple rules:
- impute categorical
NA
column values with the most frequent values in that column - in case of ties choose the alphabetically first value i.e.
aa
has preference overbb
(in the case forcol2
)
Could anyone please assist in writing a function which takes in df
as an input and returns the imputed dataframe for categorical values only. col4, col5
should remain unchanged (They have NAs but are numeric so should be ignored).
Clarification For this example:
col1
NAs should be imputed to be"aa"
col2
NAs should be imputed to be"aa"
(by alphabetic preference in ties)col3
NAs should be imputed to be"bb"
Thanks