I have a dataset of 12901 categorical
and NA
observations with 34 variables. I will use the dataset for create a market segmentation study by clustering consumer demographics.
For the categorical
variables, I want to convert to numeric
binary data. For example, variable HouseholdIncome
has six categories: 50K-75k, 75k-100k, 35k-50k, 100k-125k, 150k-175k, and Other. I want HouseholdIncome
to be broken up into six variables (0,0,0,0,0,1), (0,0,0,0,1,0), (0,0,0,1,0,0), (0,0,1,0,0,0), (0,1,0,0,0,0), and (1,0,0,0,0,0).
Question: how can I change the categorical values to binary variables, yet keep the NA
s?
My machine:
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
My data:
#Head of first six rows of the first six columns
> head(Store4df)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus
1 55-64 Female 50k-75k Single No Own
2 <NA> Female <NA> <NA> <NA> <NA>
3 <NA> Male <NA> <NA> <NA> <NA>
4 <NA> Male <NA> <NA> <NA> <NA>
5 65+ Male 75k-100k Single No Own
6 <NA> Female <NA> <NA> <NA> <NA>
I have read other posts about the command, but none have solutions for NA
values. I followed a link about Creating new dummy variable columns from categorical variables. I used the second suggestion and the data in binary form, but the code did not include the NA
values.
> #Use model.matrix function to
> binary1 <- model.matrix(~ factor(Store4df$HomeMarketValue) - 1)
> #Find which rows have NA values
> which(rowSums(is.na(binary1))==ncol(binary1))
# named integer(0)
> #Get head of model.matrix of two columns with five rows
> head(binary1, n=5)
factor(Store4df$HomeMarketValue)100k-150k factor(Store4df$HomeMarketValue)150k-200k
1 0 0
2 0 0
3 1 0
4 0 0
5 0 0
EDIT: I forgot to post that I have two types of categorical variables. One with categories and NA
values, with another having TRUE
and NA
values. I got an error about putting the variables with TRUE
and NA
values into a model.matrix
.
> model.matrix(~ -1 + . , data = Store4df)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Here's what the variable looks like:
> che <- Store4df$Pets
> summary(che)
Mode TRUE NA's
logical 3535 9628
After putting one factor variable into model.matrix
:
> data <- model.matrix(~ Pets, data = Store4df)
> summary(data)
(Intercept) PetsTRUE
Min. :1 Min. :1
1st Qu.:1 1st Qu.:1
Median :1 Median :1
Mean :1 Mean :1
3rd Qu.:1 3rd Qu.:1
Max. :1 Max. :1
How can I get the TRUE value replaced in columns 10 and 12:34?