2

I am new to Data Science and I am working on a Machine Learning analysis using Random Forest algorithm to perform a classification. My target variable in my data set is called Attrition (Yes/No).

I am a bit confused as to how to generate these 2 plots in Random Fores`:

(1) Feature Importance Plot

(2) Decision Tree Plot

I understand that Random Forest is a ensemble of several Decision Tree models from the data set.

Assuming my Training data set is called TrainDf and my Testing data set is called TestDf, how can I create these 2 plots in R?

UPDATE: From these 2 posts, it seems that they cannot be done, or am I missing something here? Why is Random Forest with a single tree much better than a Decision Tree classifier?

How would you interpret an ensemble tree model?

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
user3115933
  • 4,303
  • 15
  • 54
  • 94
  • for plotting the (pseudo) tree structure, check out the answers to this question: https://stats.stackexchange.com/questions/41443/how-to-actually-plot-a-sample-tree-from-randomforestgettree – Shinobi_Atobe Aug 24 '18 at 08:06
  • 2
    The Decision Tree Plot doesn't make sense in a RF, because (as the name suggest) there are multiple Trees, each is a little (or a lot) different from the other. So you can't make one single plot, unless somehow you average all those trees (not very useful). The importance plot can be done, see the first answer here. – RLave Aug 24 '18 at 08:10
  • OP possibly wants to be able to print any decision tree from the forest, which can be done with the `getTree` function – Sandipan Dey Aug 24 '18 at 10:41
  • @RLave I am confused as to which answer your link refers to. – user3115933 Aug 24 '18 at 12:20

2 Answers2

2

To plot the variable importance, you can use the below code.

mtcars.rf <- randomForest(am ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,
                      importance=TRUE)
varImpPlot(mtcars.rf)
RSK
  • 751
  • 2
  • 7
  • 18
2

Feature importance plot with ggplot2,

library(randomForest)
library(ggplot2)
mtcars.rf <- randomForest(vs ~ ., data=mtcars)
imp <- cbind.data.frame(Feature=rownames(mtcars.rf$importance),mtcars.rf$importance)
g <- ggplot(imp, aes(x=reorder(Feature, -IncNodePurity), y=IncNodePurity))
g + geom_bar(stat = 'identity') + xlab('Feature')

enter image description here

A Decision Tree plot with igraph (a tree from the random forest)

tree <- randomForest::getTree(mtcars.rf, k=1, labelVar=TRUE) # get the 1st decision tree with k=1
tree$`split var` <- as.character(tree$`split var`)
tree$`split point` <- as.character(tree$`split point`)
tree[is.na(tree$`split var`),]$`split var` <- ''
tree[tree$`split point` == '0',]$`split point` <- ''

library(igraph)
gdf <- data.frame(from = rep(rownames(tree), 2),
                          to = c(tree$`left daughter`, tree$`right daughter`))
g <- graph_from_data_frame(gdf, directed=TRUE)
V(g)$label <- paste(tree$`split var`, '\r\n(', tree$`split point`, ',', round(tree$prediction,2), ')')
g <- delete_vertices(g, '0')
print(g, e=TRUE, v=TRUE)
plot(g, layout = layout.reingold.tilford(g, root=1), vertex.size=5, vertex.color='cyan')

As can be seen from the following plot, the the label for each node in the decision tree represents the variable name chosen for split at that node, (the split value, the proportion of class with label 1) at that node.

enter image description here

Likewise the 100th tree can be obtained with k=100 with the randomForest::getTree() function which looks like the following

enter image description here

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63