1

I have a dataset of 12901 categorical and NA observations with 34 variables. I will use the dataset for create a market segmentation study by clustering consumer demographics.

For the categorical variables, I want to convert to numeric binary data. For example, variable HouseholdIncome has six categories: 50K-75k, 75k-100k, 35k-50k, 100k-125k, 150k-175k, and Other. I want HouseholdIncome to be broken up into six variables (0,0,0,0,0,1), (0,0,0,0,1,0), (0,0,0,1,0,0), (0,0,1,0,0,0), (0,1,0,0,0,0), and (1,0,0,0,0,0).

Question: how can I change the categorical values to binary variables, yet keep the NAs?

My machine:

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

My data:

#Head of first six rows of the first six columns
> head(Store4df)
     Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus
1  55-64 Female         50k-75k        Single                 No             Own
2   <NA> Female            <NA>          <NA>               <NA>            <NA>
3   <NA>   Male            <NA>          <NA>               <NA>            <NA>
4   <NA>   Male            <NA>          <NA>               <NA>            <NA>
5    65+   Male        75k-100k        Single                 No             Own
6   <NA> Female            <NA>          <NA>               <NA>            <NA>

I have read other posts about the command, but none have solutions for NA values. I followed a link about Creating new dummy variable columns from categorical variables. I used the second suggestion and the data in binary form, but the code did not include the NA values.

> #Use model.matrix function to 
> binary1 <- model.matrix(~ factor(Store4df$HomeMarketValue) - 1)
> #Find which rows have NA values
> which(rowSums(is.na(binary1))==ncol(binary1))
# named integer(0)
> #Get head of model.matrix of two columns with five rows
> head(binary1, n=5)
   factor(Store4df$HomeMarketValue)100k-150k factor(Store4df$HomeMarketValue)150k-200k
1                                          0                                         0
2                                          0                                         0
3                                          1                                         0
4                                          0                                         0
5                                          0                                         0

EDIT: I forgot to post that I have two types of categorical variables. One with categories and NA values, with another having TRUE and NA values. I got an error about putting the variables with TRUE and NA values into a model.matrix.

> model.matrix(~ -1 + . , data = Store4df)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

Here's what the variable looks like:

> che <- Store4df$Pets
> summary(che)
   Mode    TRUE    NA's 
logical    3535    9628 

After putting one factor variable into model.matrix:

> data <- model.matrix(~  Pets, data = Store4df)
> summary(data)

  (Intercept)    PetsTRUE
 Min.   :1    Min.   :1  
 1st Qu.:1    1st Qu.:1  
 Median :1    Median :1  
 Mean   :1    Mean   :1  
 3rd Qu.:1    3rd Qu.:1  
 Max.   :1    Max.   :1  

How can I get the TRUE value replaced in columns 10 and 12:34?

Scott Davis
  • 983
  • 6
  • 22
  • 43
  • Why do you need this? What do you plan to do with the data afterward? R generally prefers categorical data to remain in factor form rather than dummy variable form and then just apply contrasts as needed. – MrFlick Sep 11 '14 at 18:37
  • @MrFlick I will use the data to perform clustering analysis with the CLARA function from the CLUSTER package. I tried using PAM, but couldn't create the dissimilarity measure beforehand b/c of the partial fill rate of the dataset. I am switching to a numeric algorithm that takes a sample instead of calculating the distance from every observation. – Scott Davis Sep 11 '14 at 18:56

3 Answers3

3

I don't think model.matrix can take an argument to detail how to treat missing data However, you can change the default options to na.pass thus keeping the missing values in the model.matrix call.

# create data with missing values
set.seed(1)
dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20), 
                                                  stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA

# set default options for handling missing data
options(na.action='na.pass')

# note that rows with missing data are retained
m <- model.matrix(~ -1 + x + y, data=dat)

# return option to default
options(na.action='na.omit')

From here

Community
  • 1
  • 1
user20650
  • 24,654
  • 5
  • 56
  • 91
  • I ran the script for retaining rows with missing data, but got an error. I changed the dataset in your example to Store4df. m <- model.matrix(~ -1 + x + y, data = Store4df) Error in eval(expr, envir, enclos) : object 'x' not found – Scott Davis Sep 13 '14 at 20:47
  • 1
    That will be because `x` is not defined in `Storedf`. For your actual data try `Store4df.m <- model.matrix(~ -1 + HomeMarketValue, data = Store4df)` – user20650 Sep 13 '14 at 20:57
  • Thanks that worked! To include other categorical variables in addition to HomeMarketValue, can I define multiple x variables? c(HouseholdIncome, HomeMarketValue, ect.). – Scott Davis Sep 13 '14 at 22:52
  • 1
    Good stuff. Yes you can add multiple variables, for example, `model.matrix(~ -1 + HouseholdIncome + HomeMarketValue, data = Store4df)`. You could actually just do `model.matrix(~ -1 + . , data = Store4df)` if you want to include all variables. Also not sure how you want to deal with the reference level but if you want to include all levels of each variable in the dummy matrix then [this question will help](http://stackoverflow.com/questions/4560459/all-levels-of-a-factor-in-a-model-matrix-in-r) – user20650 Sep 13 '14 at 23:04
  • I forgot to post I have variables with less than two factors. How can I convert a variable with only TRUE and NA values to (1,NA)? – Scott Davis Sep 15 '14 at 18:19
  • 1
    You can just add zero as `TRUE` is treated as `1`. Try `c(TRUE, FALSE, NA) + 0` or `c(TRUE, FALSE, NA) * 1` – user20650 Sep 15 '14 at 23:19
  • Can I put that in the command for including all variables or as a separate object? My goal is to create the dataset with the two types of categorical variables at the same time. I don't want to edit two different datasets then combine them afterwards. – Scott Davis Sep 16 '14 at 01:41
  • 1
    Scott, sorry, I don't quite follow the comment above. But regarding the edit in your question: I don't think it is the `pets` variable causing the problem. ie try `model.matrix(~ Pets, data=Store4df)` - should run ok. – user20650 Sep 16 '14 at 02:34
  • 1
    The error looks as if it is coming from one of your variables being constant / having only one level. Try this example: `dat <- data.frame(x=sample(letters[1:3],10,TRUE),z=factor("oneLevel"), stringsAsFactors=FALSE)` and then `model.matrix(~ -1 + x + z, data=dat)` - this reproduces the error in your edit, as `z` has only one level. – user20650 Sep 16 '14 at 02:37
  • I put in `PETS` as the z factor in `model.matrix`, but the number of observations was decreased and every row shows 1. Now there are 1859 observations in the dataset instead of 12901. I put in the summary in the edit. – Scott Davis Sep 16 '14 at 20:12
  • 1
    @ScottDavis; that is probably because you did not use the `na.pass` option, so it has removed the rows with `NA`'s. But the fact it runs shows that it is not the variable causing the problem. You need to examine your dataset and look for variables with only one level. – user20650 Sep 16 '14 at 20:25
  • You are right, I forgot to run the first line. Thanks! – Scott Davis Sep 16 '14 at 20:59
1

For a work-around, what I would do is replace the <NA> with "Not Available" (or something). Then, <NA> will also be treated as a factor level as well.

copy <-Store4df
levels(copy$HomeMarketValue) <- c(levels(copy$HomeMarketValue),"Not Available")
copy$HomeMarketValue[is.na(copy$HomeMarketValue)]<-"Not Available"
binary1 <- model.matrix(~ factor(copy$HomeMarketValue) - 1)

Note that I have not tested the above because you do not provide data I can use to reproduce your example. But, now you should get a factor level dummy variable that is 1 for "Not Available".

By way of example:

A<-data.frame(ID=1:100,x=sample(c(1:5,NA),100,replace=TRUE))
A$x[is.na(A$x)]<-"NotAvailable"
MM<-model.matrix(~factor(A$x)-1)
for(i in 1:5) {
  MM[,i][MM[,6]==1]<-NA
}
MM<-MM[,-6]
head(MM)
##  factor(A$x)1 factor(A$x)2 factor(A$x)3 factor(A$x)4 factor(A$x)5
##1            0            0            1            0            0
##2            1            0            0            0            0
##3           NA           NA           NA           NA           NA
##4            1            0            0            0            0
##5            0            0            0            1            0
##6            0            0            0            1            0
mgriebe
  • 908
  • 5
  • 8
  • @mrgriebe I got an error on the second line of the solution saying Warning message: In `[<-.factor`(`*tmp*`, is.na(copy$HomeMarketValue), value = c(9L, : invalid factor level, NA generated – Scott Davis Sep 11 '14 at 19:07
  • 1
    use 'levels' to add the factor level "NotAvailable". – mgriebe Sep 11 '14 at 19:10
  • @mrgriebe Looking back on the solution, I don't think making NA into a variable is a good idea. I think all the categorical observations should have their own column, but leave in the NA values. How can we do that? I will use the dataset for clustering and CLARA will calculate the distance from one missing objection to another. After creating the sample, CLARA will assign a NA value to the medoid. I edited my post to show the use of the dataset. – Scott Davis Sep 11 '14 at 21:59
  • Like I said, its a work around. I'd argue that it still answers your question. – mgriebe Sep 11 '14 at 22:12
  • You are right, we still have the NA values in the dataset. However, I am calculating distances and as long as every pair of observations have at least one case missing, its okay. I have not seen any data miners converting NAs to a binary variable. Making NA a factor level may be useful in some cases, but we need NAs within each factor. – Scott Davis Sep 11 '14 at 22:26
1

This is easily done using the caret package.
Code below is a quick way to get it done for any number of variables at a time.

require(caret)
# Make sure variables you are using are factors

VARS.TO.MAKE.DUMMY <-  #list of variables to convert to dummy
    c("HouseholdIncome", "Age")

dat.temp <- # Temporary data.frame to make dummies
    Store4df[,VARS.TO.MAKE.DUMMY]

dummy.vars <- # create dummies  
    predict(  
        dummyVars(  
            ~ .,   
            data = dat.temp
            ),
        newdata = dat.temp,
        na.action = na.pass
        )

Store4df <- # Append results to original dataframe
    cbind(Store4df, as.data.frame(dummy.vars))

rm(dummy.vars, dat.temp) # Garbage collection
TaylerJones
  • 242
  • 2
  • 15