rpart finding the observations in each node

Question

I have created a decision tree using rpart, and I am wondering how to find exactly which cases of the training data are falling into each terminal node.

I followed the answer in this link: How to count the observations falling in each node of a tree but for some reason the $where function is only producing a vector of terminal nodes without the row numbers indicating which case is corresponding to which terminal node. However if I do the exact same thing with a tree made using the tree package, I would get a list of row numbers (identifying each case) with the corresponding terminal node. I noticed that the only difference is that for the rpart package, $where produces a "int" vector while for the tree package, $where produces a "Named int" vector. I am wondering how to produce the same "Named int" vector for a tree made from rpart?

I have also tried the answer suggested in: Find the data elements in a data frame that pass the rule for a node in a tree model? but it does not work for me because rpart deleted 16 observations while creating the model and hence the number of observation in the resulting model does not match the original data frame used to create the model.

Sorry if the answer seems obvious, newbie R user here!

Here is the code I used to create the tree, its a tree used predict diagnosis of autism based on behavioural profiles:

Set.seed(565808016)
inTrain21<- createDataPartition(clinicaldiagnosis, p=0.75, list=FALSE)
training_data21<- Decisiontree4[ inTrain21,]
testing_data21<- Decisiontree4[-inTrain21,]
test_clinicaldiagnosis21<-clinicaldiagnosis[-inTrain21]
lossmatrix=matrix(c(0,1,1,1,0,1,2,1,0), ncol=3, nrow=3)

set.seed(591251974)
tree_model22= rpart(clinicaldiagnosis~ Visualtracking + etc etc, training_data21, na.action=na.rpart, method="class"， control=rpart.control(cp=0.00001), parms=list(loss=lossmatrix))
plot(tree_model22, uniform=TRUE, margin=0.05)
text(tree_model22, use.n=TRUE, pretty=0)
plotcp(tree_model22)
printcp(tree_model22)

pruned_model22=prune(tree_model22, cp=0.0146341)
plot(pruned_model22, uniform=TRUE, margin=0.1)
text(pruned_model22, use.n=TRUE, cex=0.85, splits=TRUE, pretty=0)

tree_pred22=predict(pruned_model22, testing_data21, type="class")
table(tree_pred22, test_clinicaldiagnosis21)
trainingnodes22<-rownames(pruned_model22$frame)[pruned_model22$where] #this only gives a list of terminal nodes without the corresponding row names

Please post some piece of your code or more technical details like the condition that every training data must accomplish in order to fall into each terminal node — Juan David, Aug 06 '14 at 22:30
How did you get rpart to drop observations. I tried adding in some NA values but they still get classified in my test. Is there a way you can demonstrate this using the built in kyphosis dataset? Without a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) it is difficult to help. We can't run the code you posted because we don't have the data. — MrFlick, Aug 06 '14 at 22:58
Solved the problem I think! Using the second method, I just found the missing observations rpart deleted and took them out of the dataset. I think rpart deleted those observations because it had NA for all the predictors, but I wanted to keep those observations in because it had non NA values for other predictors that weren't used to build the tree. Thank you very much for coming up with the function! Although I still don't get why $where doesn't bring up a Named int vector .... — Linda, Aug 07 '14 at 05:27

rpart finding the observations in each node

0 Answers0

Linked