3

I want to find the distance of samples to the decision boundary of a trained decision trees classifier in scikit-learn. The features are all numeric and the feature space could be of any size.

I have this visualization so far for an example 2D case based on here:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons

# Generate some example data
X, y = make_moons(noise=0.3, random_state=0)

# Train the classifier
clf = DecisionTreeClassifier(max_depth=2)

clf.fit(X, y)

# Plot
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k')
plt.xlabel('a'); plt.ylabel('b');

enter image description here

I understand that for some other classifiers like SVM, this distance can be mathematically calculated [1, 2, 3]. The rules learned after training a decision trees define the boundaries and may also be helpful to algorithmically calculate the distances [4, 5, 6]:

# Plot the trained tree
from sklearn import tree
import graphviz 
dot_data = tree.export_graphviz(clf, feature_names=['a', 'b'],  class_names=['1', '2'], filled=True)  
graph = graphviz.Source(dot_data)  

enter image description here

Reveille
  • 4,359
  • 3
  • 23
  • 46

2 Answers2

3

Since there can be multiple decision boundaries around a sample, I'm going to assume distance here refers to distance to nearest decision boundary.

The solution is a recursive tree traversal algorithm. Note that decision tree doesn't allow a sample to be on boundary, like e.g. SVM, each sample in feature space must belong to one of the classes. So here we will keep modifying the sample's feature in small steps, and whenever that leads to a region with a different label (than one originally assigned to the sample by trained classifier), we assume we've reached decision boundary.

In detail, like any recursive algorithm, we have two main cases to consider:

  1. Base case i.e. we're at a leaf node. We simply check if the current sample have different label: if so then return it, otherwise return None.
  2. Non leaf nodes. There are two branches, we send the sample to both. We don't modify the sample to send it to branch it would naturally take. But before sending it to the other branch, we look at the (feature, threshold) pair of the node, and modify the sample's given feature just enough to push it on the opposite side of threshold.

Complete python code:

def f(node,x,orig_label):
    global dt,tree
    if tree.children_left[node]==tree.children_right[node]: #Meaning node is a leaf
        return [x] if dt.predict([x])[0]!=orig_label else [None]

    if x[tree.feature[node]]<=tree.threshold[node]:
        orig = f(tree.children_left[node],x,orig_label)
        xc = x.copy()
        xc[tree.feature[node]] = tree.threshold[node] + .01
        modif = f(tree.children_right[node],xc,orig_label)
    else:
        orig = f(tree.children_right[node],x,orig_label)
        xc = x.copy()
        xc[tree.feature[node]] = tree.threshold[node] 
        modif = f(tree.children_left[node],xc,orig_label)
    return [s for s in orig+modif if s is not None]

This is going to return us a list of samples that lead to leaves with different label. All we need to do now is to take the nearest one:

dt =  DecisionTreeClassifier(max_depth=2).fit(X,y)
tree = dt.tree_
res = f(0,x,dt.predict([x])[0]) # 0 is index of root node
ans = np.min([np.linalg.norm(x-n) for n in res]) 

For illustration:

enter image description here

Blue is the original sample, yellow is the nearest sample "on" decision boundary.

Shihab Shahriar Khan
  • 4,930
  • 1
  • 18
  • 26
  • 1
    Thanks! I'm trying to run it on my example above. Question: `dt` is the trained classifier, right? i.e., `clf` in my code above. And, what about `tree`? I am getting `AttributeError: module 'sklearn.tree' has no attribute 'children_left'`, which makes sense but not sure how to fix it. – Reveille Apr 11 '20 at 15:17
  • my bad, hastily removed some part of the answer after it appeared too big :) yes `dt` is trained classifier, `tree=dt.tree_`: the tree internally used by `dt` – Shihab Shahriar Khan Apr 11 '20 at 16:29
  • Thanks. I don't see where `label` is initialized in the scripts; it throws an error. – Reveille Apr 11 '20 at 18:11
  • sorry for this types of edits, but I can't test for some reasons. Is it ok now? – Shihab Shahriar Khan Apr 11 '20 at 18:46
  • 1
    Yes, it runs now. I think the distance calculation is incorrect though. I replaced it with this `ans = np.min([np.linalg.norm(x-n) for n in res])`. I will test it further. – Reveille Apr 11 '20 at 19:32
  • I tried on different cases and it appears to be working all well so far. Do you mind correcting the distance line? And, for consistency with the question scripts and images, changing the `max_depth=4` to `max_depth=2`. Thanks! – Reveille Apr 12 '20 at 19:15
  • glad to hear that – Shihab Shahriar Khan Apr 12 '20 at 20:59
1

Decision tree does not learn to draw a decision boundary. It tries to split the tree based on the maximum information gain point. For this process, decision tree algorithm uses entropy or gini indexes.

Because of this reason, you cannot find the distance between the points and the decision boundary( there is no decision boundary).

If you want you can calculate the distance between the points and the lines that you draw on graphic. So it approximately gives some results.

Batuhan B
  • 1,835
  • 4
  • 29
  • 39
  • 2
    Thanks for the helpful info. Yes, but the concept of decision boundary is still relevant for decision trees (and so the distance) too, such as the Slide 7 [here](http://web.engr.oregonstate.edu/~xfern/classes/cs434/slides/decisiontree-4.pdf) or [here](http://courses.csail.mit.edu/6.034s/handouts/spring12/learn-review-sol.pdf) – Reveille Apr 01 '20 at 13:56
  • 2
    Decision boundaries can be drawn after the points are classified. I mean that when training phase of decision tree, there is no any decision boundary. After the training phase, you can draw or specify the decision boundry. At least i know like that:) – Batuhan B Apr 01 '20 at 14:14
  • 1
    I think you could draw decision boundaries after every split node step, also because of the discontinuity nature of decision trees you also can have many disconnected decision boundaries in the same split step – Bernardo stearns reisen Apr 05 '20 at 02:10
  • @Bernardostearnsreisen yes, true. We can. I added the learnt tree structure plot too. I think this contains the inputs we need to calculate the distances, but I have not been able to come up with an algorithm for that yet. – Reveille Apr 05 '20 at 12:44