0

I want to retrieve the path each instance takes in decision tree or RandomForest. for instance, I need such an output:

# 1  1 3 4 8 NA NA
# 2  1 2 5 7 11 NA
# 3  1 3 4 9 10 13
# 4  1 3 4 8 NA NA
# etc

It means that instance #1 passes the path from node 1, 3, 4 and ended in terminal node 8 and so forth. It is obvious that the path length of some instances is shorter than others.

I used decision_path but it gives a sparse matrix which I can not understand and find such a path. Even I cannot read the output. It is the sample code for Iris database:

from sklearn.datasets import load_iris
iris = load_iris()
import numpy as np
ytrain = iris.target
xtrain = iris.data
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
fitted_tree = dtree.fit(X=xtrain,y=ytrain)
predictiontree = dtree.predict(xtrain)
fitted_tree.decision_path(xtrain)

The output is this:

<150x17 sparse matrix of type '<class 'numpy.int64'>'
with 560 stored elements in Compressed Sparse Row format>

Please help me make the matrix such as the one I mentioned at the top. I have no idea how to handle sparse matrix.

Hadij
  • 3,661
  • 5
  • 26
  • 48
  • use `.todense()` ? [generating-a-dense-matrix-from-a-sparse-matrix-in-numpy-python](https://stackoverflow.com/questions/16505670/generating-a-dense-matrix-from-a-sparse-matrix-in-numpy-python) – Patrick Artner Jan 17 '18 at 16:32
  • Possible duplicate of [Generating a dense matrix from a sparse matrix in numpy python](https://stackoverflow.com/questions/16505670/generating-a-dense-matrix-from-a-sparse-matrix-in-numpy-python) – Patrick Artner Jan 17 '18 at 16:33
  • @PatrickArtner can you interpret the output of .todense? it is like [1 0 1 0 1 0 1 0 0 0 0 0 0 0 ] for one instance. Do you know what does it mean? – Hadij Jan 17 '18 at 18:36
  • its the representation of whatever this DecisionTreeClassifier did to your data - do you have a `1,2,3,4,5,6,7,8,9,10,11,12,13,14` path that walks over `1,3,5,7` - a sparse matric is used if you have far more "empties" or "defaults" as real data, so its "cheaper" to store only the datapoint/at/index then the whole row. as in you have a 1000x1000 array with 100 ints in it, the rest 0 - so you store 100 (x,y,value) tuples instead of 1.000.000 values of which most are 0 – Patrick Artner Jan 17 '18 at 18:43
  • @PatrickArtner thanks a lot. you marked the question as repeated. Somehow It is repeated but it is something that may happen to whom is going to find the path in decision trees. I hope you let the question remain open and add this comment as an answer. It was really helpful to me. The reason why I need to make it dense is that it should be the input for another method which does not accept dense matrix. – Hadij Jan 17 '18 at 18:55
  • Write up an answer yourself please, I was unable to install scipy locally and cant test code. You can then even mark your own answer as "the answer" in 2 days or so - also you might want to read up here https://stackoverflow.com/tags/sparse-matrix/info and in the linked wikipedia – Patrick Artner Jan 17 '18 at 18:56

2 Answers2

0

Thanks to the comment of @Patrick Artner, this is the answer:

dense_matrix = fitted_tree.decision_path(xtrain).todense()

It will give the output like

#matrix([[1, 1, 0, ..., 0, 0, 0],
#        [1, 1, 0, ..., 0, 0, 0],
#        [1, 1, 0, ..., 0, 0, 0],
#        ..., 
#        [1, 0, 1, ..., 0, 0, 1],
#        [1, 0, 1, ..., 0, 0, 1],
#        [1, 0, 1, ..., 0, 0, 1]], dtype=int64)

The first row is the first instance and so on. For example, this is the first row [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] which means that the first instance goes through the nodes number 1 and 2 and never passes the others.

Hadij
  • 3,661
  • 5
  • 26
  • 48
0

Or you can also do the following if you need more control over each sample's decision path:

decision_paths = fitted_tree.decision_path(xtrain)
decision_path_list = list(decision_paths.toarray())
for path in decision_path_list:
    *#Analyse different paths here*
DesiKeki
  • 656
  • 8
  • 9