I am trying to replace the NA values in columns with 'UNK' to be able to execute a logistic regression.
Here is the code and their outputs respectively. I just want to lay out each step I took for context (It is important to note I did not include every column, but the same issue happens with all of the columns):
donors <- read_csv("donors.csv", col_types = "nnffnnnnnnnnffffffffff")
glimpse(donors)
Rows: 95,412
Columns: 22
$ age <dbl> 60, 46, NA, 70, 78, NA, 38, ~
$ numberChildren <dbl> NA, 1, NA, NA, 1, NA, 1, NA,~
$ incomeRating <fct> NA, 6, 3, 1, 3, NA, 4, 2, 3,~
Here, I just singled out the factored features to see visualize them more clearly:
donors %>% keep( is.factor) %>% summary()
incomeRating wealthRating inHouseDonor
NA :21286 NA :44732 FALSE:88709
5 :15451 9 : 7585 TRUE : 6703
2 :13114 8 : 6793
4 :12732 7 : 6198
1 : 9022 6 : 5825
3 : 8558 5 : 5280
(Other):15249 (Other):18999
Now, I try to replace all of the NA values in the incomeRating column (and other columns) with 'UNK':
donors <- donors %>% mutate( incomeRating = as.character( incomeRating))
%>% mutate( incomeRating = as.factor( ifelse( is.na( incomeRating), 'UNK', incomeRating)))
There is no error message, but when I retrieve the proportional values table like so, the NA's are not replaced:
donors%>%
select(incomeRating) %>%
table() %>%
prop.table()
1 2 3 4 5
0.09455834 0.13744602 0.08969522 0.13344233 0.16193980
6 7 NA
0.08152014 0.07830252 0.22309563
Again, this happens with all columns. I believe that R reads the NA as actual values, therefore I cannot use the is.na() command to read those values. If this is the case, what is a solution for this? Thank you ahead of time.