building classification tree having categorical variables using rpart

Question

I have a data set with 14 features and few of them are as below, where sex and marital status are categorical variables.

height,sex,maritalStatus,age,edu,homeType

SEX
         1. Male
         2. Female

MARITAL STATUS
         1. Married
         2. Living together, not married
         3. Divorced or separated
         4. Widowed
         5. Single, never married

Now I am using rpart library from R to build a classification tree using the following

rfit = rpart(homeType ~., data = trainingData, method = "class", cp = 0.0001)

This gives me a decision tree that does not consider sex and marital status as factors.

I am thinking of using as.factor for this :

sex = as.factor(trainingData$sex)
ms = as.factor(trainingData$maritalStatus)

But I am not sure how do i pass this information to rpart. Since the data argument in rpart() takes in "trainingData" data frame. It will always take the values that are in this data frame. I am little new to R and would appreciate someone's help on this.

score 11 · Accepted Answer · answered Nov 14 '14 at 14:37

11

You could make the changes to the trainingData data frame directly, then run rpart().

trainingData$sex = as.factor(trainingData$sex)
trainingData$maritalStatus = as.factor(trainingData$maritalStatus)
rfit = rpart(homeType ~., data = trainingData, method = "class", cp = 0.0001)

answered Nov 14 '14 at 14:37

Jean V. Adams

4,634
2
29
46

I am trying to apply that answer to a similar example where my categorical variable is the days of the week. Just to be sure I left only that variable in training set but when I try to train the classifier the model appears to have only one root node, which means that i doesn't take the variable into account. Do you have any idea what might be the problem? – LetsPlayYahtzee Dec 26 '15 at 20:23

score -4 · Answer 2 · answered Jun 14 '17 at 10:30

In practice you can transform any categorical value into an ordinal value, for instance 'Marital Status' into conditions 1, 2, 3... But, in general you shouldn't make the transformation unless you have a conceptual definition of any continuous value. For example, if you cannot define what is a 1.2 Martital Status, you shouldn't make the transformation. Instead, sometimes you can use a representative value, depending on the objective of your research. For instance, if you are trying to link your data to predict the type of home, the 'minimum degree of comfort' of each marital status is an ordinal value that is able to be interpreted if (let's say) is 1.2.

building classification tree having categorical variables using rpart

2 Answers2