0

I am trying to replace the NA values in columns with 'UNK' to be able to execute a logistic regression.

Here is the code and their outputs respectively. I just want to lay out each step I took for context (It is important to note I did not include every column, but the same issue happens with all of the columns):

donors <- read_csv("donors.csv", col_types = "nnffnnnnnnnnffffffffff")

glimpse(donors) 
Rows: 95,412
Columns: 22
$ age                     <dbl> 60, 46, NA, 70, 78, NA, 38, ~
$ numberChildren          <dbl> NA, 1, NA, NA, 1, NA, 1, NA,~
$ incomeRating            <fct> NA, 6, 3, 1, 3, NA, 4, 2, 3,~

Here, I just singled out the factored features to see visualize them more clearly:

donors %>% keep( is.factor) %>% summary()
  incomeRating    wealthRating   inHouseDonor 
 NA     :21286   NA     :44732   FALSE:88709  
 5      :15451   9      : 7585   TRUE : 6703  
 2      :13114   8      : 6793                
 4      :12732   7      : 6198                
 1      : 9022   6      : 5825                
 3      : 8558   5      : 5280                
 (Other):15249   (Other):18999    

Now, I try to replace all of the NA values in the incomeRating column (and other columns) with 'UNK':

donors <- donors %>% mutate( incomeRating = as.character( incomeRating)) 
%>% mutate( incomeRating = as.factor( ifelse( is.na( incomeRating), 'UNK', incomeRating)))

There is no error message, but when I retrieve the proportional values table like so, the NA's are not replaced:

donors%>%
  select(incomeRating) %>%
  table() %>%
  prop.table()
         1          2          3          4          5 
0.09455834 0.13744602 0.08969522 0.13344233 0.16193980 
         6          7         NA 
0.08152014 0.07830252 0.22309563 

Again, this happens with all columns. I believe that R reads the NA as actual values, therefore I cannot use the is.na() command to read those values. If this is the case, what is a solution for this? Thank you ahead of time.

LoveMYMAth
  • 111
  • 5
  • 1
    Try something like : `donors <- donors %>% mutate( incomeRating = ifelse( is.na( incomeRating), 'UNK', incomeRating))` (i.e not transforming to as.character ans without the factor). Also might want to look at the `mutate_across` for multiple columns – thehand0 Feb 09 '22 at 20:48
  • 1
    I'm guessing the factor labels are literally `"NA"` the two letters `"N"` and `"A"` in a string - compare `summary(factor(c(NA,1:5)))` and you'll see that actual `NA` values are listed at the end under `NA's` instead of at the start. Try `incomeRating == "NA"`. – thelatemail Feb 09 '22 at 21:03
  • 1
    Since this is an issue with data types, a [reproducible example](https://stackoverflow.com/q/5963269/5325862) of your data is needed to do more than guess – camille Feb 09 '22 at 21:10
  • Indeed, reproducible data would confirm it. I'm 99.94% certain they're not actual `NA`s though. Hint 2 would be that `table(factor(c(NA,1:5)))` doesn't show `NA` values by default but they are shown in your output. – thelatemail Feb 09 '22 at 21:18
  • reproducible data can be found in the chapter 5 download section on this website. The csv file is called "donors" https://www.wiley.com/en-us/Practical+Machine+Learning+in+R-p-9781119591511#downloads-section – LoveMYMAth Feb 09 '22 at 21:20
  • @thelatemail was correct. I figured it out. Also, how would I be able to use the ```mutate_across``` function In a situation like this – LoveMYMAth Feb 09 '22 at 21:23

1 Answers1

0

Try fct_explicit_na from forcats package: Code not tested!

library(forcats)
library(dplyr)

donors <- donors %>% 
  mutate(incomeRating = fct_explicit_na(incomeRating, "UNK")
TarJae
  • 72,363
  • 6
  • 19
  • 66