I have a dataframe, in which some of the variables (columns) are factorial, when for some records I have missing values (NA).
Questions are:
What is the correct approach of replacing\imputing NAs in factorial variables?
e.g VarX with 4 Levels {"A", "B", "C", "D"} - What would be the preffered value to replace NAs with? A\B\C\D? Maybe just 0? Maybe impute with the level that is the majority for this variable observations?
How to implement such imputation, based on answer to 1?
Once 1&2 resolved, I'll use the following to create dummy variables for the factorial variables:
is.fact <- sapply(my_data, is.factor) my_data.dummy_vars <- dummy.data.frame(my_data[, is.fact], sep = ".")
Afterwards, how do I replace all the factorial variables in
my_data
with the dummy variables i've extracted intomy_data.dummy_vars
?
My use case is to calculate principal components afterwards (Which needs all variables to have numerical values, thus the dummy variables)
Thanks