7

I am trying to create a surface plot on an external visualization platform. I'm working with the iris data set that is featured on the sklearn decision tree documentation page. I'm also using the same approach to create my decision surface plot. My end goal though is not the matplot lib visual, so from here I input the data to my visualization software. To do this I just called flatten() and tolist() on xx, yy and Z and wrote a JSON file containing these lists.

The trouble is when I try to plot it, my visualization program crashes. It turns out the data is too large. When flattened the length of the list is >86,000. This is due to the fact the step size/plot step is very small .02. So it is essentially taking baby steps across the domain of the data's min and max and plotting/filling as it goes, according to the model's predictions.It's kind of like a pixel-grid; I shrunk the size down to an array of only 2000 and noticed that the coordinates were just lines going back and forth (eventually encompassing the entire coordinate plane).

Question: Can I retrieve the x,y coordinates of the decision boundary lines themselves (as opposed to iterating across the whole plane)? Ideally a list containing only the turning points of each line. Or alternatively, is there maybe some other completely different way to recreate this plot, so that it is more computationally efficient?

This can somewhat be visualized by replacing the contourf() call with countour():

enter image description here

I'm just not sure how to retrieve the data governing those lines (via xx, yy and Z or possibly other means?).

Note: I'm not picky about the exact format of the list/or data structure that contains the lines format as long as its computationally efficient. For instance, for the first plot above, some red areas are actually islands in the prediction space, so that might mean we'd have to handle it like it's its own line. I'm guessing as long as the class is coupled with the x,y coordinates, it shouldn't matter how many arrays (containing coordinates)are used to capture the decision boundaries.

Arash Howaida
  • 2,575
  • 2
  • 19
  • 50
  • This would be a great feature to add to `sklearn` if you ever figure it out :) – Marijn van Vliet May 12 '17 at 06:27
  • Yea, I hope there is something built-in that can do the trick, but if there isn't I also think it would make a good new feature. I guess now I have a new appreciation for how powerful matplotlib is, since it can handle these huge data structures, however for those of us needing a more condensed data representation for visualization on other platforms, such a feature would be extremely useful. – Arash Howaida May 12 '17 at 07:02

3 Answers3

7

Decision trees do not have very nice boundaries. They have multiple boundaries that hierarchically split the feature space into rectangular regions.

In my implementation of Node Harvest I wrote functions that parse scikit's decision trees and extract the decision regions. For this answer I modified parts of that code to return a list of rectangles that correspond to a trees decision regions. It should be easy to draw these rectangles with any plotting library. Here is an example using matplotlib:

n = 100
np.random.seed(42)
x = np.concatenate([np.random.randn(n, 2) + 1, np.random.randn(n, 2) - 1])
y = ['b'] * n + ['r'] * n
plt.scatter(x[:, 0], x[:, 1], c=y)

dtc = DecisionTreeClassifier().fit(x, y)
rectangles = decision_areas(dtc, [-3, 3, -3, 3])
plot_areas(rectangles)
plt.xlim(-3, 3)
plt.ylim(-3, 3)

enter image description here

Wherever regions of different color meet there is a decision boundary. I imagine it would be possible with moderate effort to extract just these boundary lines but I'll leave that to anyone who is interested.

rectangles is a numpy array. Each row corresponds to one rectangle and the columns are [left, right, top, bottom, class].


Update: Application to the Iris data set

The Iris data set contains three classes instead of 2, like in the example. So we have to add another color to the plot_areas function: color = ['b', 'r', 'g'][int(rect[4])]. Furthermore, the data set is 4-dimensional (it contains four features) but we can only plot two features in 2D. We need to chose which features to plot and tell the decision_area function. The function takes two arguments x and y - these are the features that go on the x and y axis, respectively. The default is x=0, y=1 which works with any data set that has more than one feature. However, in the Iris data set the first dimension is not very interesting so we will use a different setting.

The function decision_areas also does not know about the extent of the data set. Often the decision tree has open decision ranges that extend toward infinity (e.g. Whenever sepal length is less than xyz it's class B). In this case we need to artificially narrow down the range for plotting. I chose -3..3 for the example data set but for the iris data set other ranges are appropriate (there are never negative values, some features extend beyond 3).

Here we plot the decision regions over the two last features in a range of 0..7 and 0..5:

from sklearn.datasets import load_iris
data = load_iris()
x = data.data
y = data.target
dtc = DecisionTreeClassifier().fit(x, y)
rectangles = decision_areas(dtc, [0, 7, 0, 5], x=2, y=3)
plt.scatter(x[:, 2], x[:, 3], c=y)
plot_areas(rectangles)

enter image description here

Note how there is a weird overlap of the red and green areas in the top left. This happens because the tree makes decisions in four dimensions but we can show only two. There is not really a clean way around this. A high dimensional classifier often has no nice decision boundaries in low-dimensional space.

So if you are more interested in the classifier that is what you get. You can generate different views along various combinations of dimensions but there are limits to the usefulness of the representation.

However, if you are more interested in the data than in the classifier you can restrict the dimensionality before fitting. In that case the classifier only makes decisions in the 2-dimensional space and we can plot nice decision regions:

from sklearn.datasets import load_iris
data = load_iris()
x = data.data[:, [2, 3]]
y = data.target
dtc = DecisionTreeClassifier().fit(x, y)
rectangles = decision_areas(dtc, [0, 7, 0, 3], x=0, y=1)
plt.scatter(x[:, 0], x[:, 1], c=y)
plot_areas(rectangles)

enter image description here


Finally, here is the implementation:

import numpy as np
from collections import deque
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import _tree as ctree
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle


class AABB:
    """Axis-aligned bounding box"""
    def __init__(self, n_features):
        self.limits = np.array([[-np.inf, np.inf]] * n_features)

    def split(self, f, v):
        left = AABB(self.limits.shape[0])
        right = AABB(self.limits.shape[0])
        left.limits = self.limits.copy()
        right.limits = self.limits.copy()

        left.limits[f, 1] = v
        right.limits[f, 0] = v

        return left, right


def tree_bounds(tree, n_features=None):
    """Compute final decision rule for each node in tree"""
    if n_features is None:
        n_features = np.max(tree.feature) + 1
    aabbs = [AABB(n_features) for _ in range(tree.node_count)]
    queue = deque([0])
    while queue:
        i = queue.pop()
        l = tree.children_left[i]
        r = tree.children_right[i]
        if l != ctree.TREE_LEAF:
            aabbs[l], aabbs[r] = aabbs[i].split(tree.feature[i], tree.threshold[i])
            queue.extend([l, r])
    return aabbs


def decision_areas(tree_classifier, maxrange, x=0, y=1, n_features=None):
    """ Extract decision areas.

    tree_classifier: Instance of a sklearn.tree.DecisionTreeClassifier
    maxrange: values to insert for [left, right, top, bottom] if the interval is open (+/-inf) 
    x: index of the feature that goes on the x axis
    y: index of the feature that goes on the y axis
    n_features: override autodetection of number of features
    """
    tree = tree_classifier.tree_
    aabbs = tree_bounds(tree, n_features)

    rectangles = []
    for i in range(len(aabbs)):
        if tree.children_left[i] != ctree.TREE_LEAF:
            continue
        l = aabbs[i].limits
        r = [l[x, 0], l[x, 1], l[y, 0], l[y, 1], np.argmax(tree.value[i])]
        rectangles.append(r)
    rectangles = np.array(rectangles)
    rectangles[:, [0, 2]] = np.maximum(rectangles[:, [0, 2]], maxrange[0::2])
    rectangles[:, [1, 3]] = np.minimum(rectangles[:, [1, 3]], maxrange[1::2])
    return rectangles

def plot_areas(rectangles):
    for rect in rectangles:
        color = ['b', 'r'][int(rect[4])]
        print(rect[0], rect[1], rect[2] - rect[0], rect[3] - rect[1])
        rp = Rectangle([rect[0], rect[2]], 
                       rect[1] - rect[0], 
                       rect[3] - rect[2], color=color, alpha=0.3)
        plt.gca().add_artist(rp)
MB-F
  • 22,770
  • 4
  • 61
  • 116
  • great example, and custom functions! So if I wanted to export the rectangle x,y coordinates to json, `rectangles` would be a nested array with all the rectangles with for the plot(s) right? – Arash Howaida May 13 '17 at 00:31
  • @ArashHowaida I'm not sure if you can convert a numpy array directly to json, but you can call `rectangles.tolist()` to get nested lists that Python's json tools export to nested json arrays. – MB-F May 13 '17 at 14:42
  • I sometimes get negative values for my x/y coordinates even when the domain is strictly positive. Could that be due to entering a `maxrange` that is not exact enough or is it other user error? – Arash Howaida May 13 '17 at 19:08
  • I just noticed in your original random data example there was in fact a fifth entry for the predicted class. I guess I messed something up when I adapted it for the iris data set. For robustness, could you include a short set up to demonstrate how the iris data interfaces with your functions? maybe just the first pair: `x = iris.data[:,[0,1]]` – Arash Howaida May 13 '17 at 19:16
  • @ArashHowaida Yes, the rectangles contain five columns - four coordinates and the predicted class (for choosing the color). I've added examples with the iris data set. – MB-F May 15 '17 at 07:30
  • Only issue I have up until now is whenever I try adjusting the `max_depth` to `2`,`4` or anything other than `3` I get an error from the your function at the following line: `r = [l[x, 0], l[x, 1], l[y, 0], l[y, 1], np.argmax(tree.value[i])]`. It says index 1 out of bounds for axis 0 size 1. What should I change to make it robust to any `max_depth`? – Arash Howaida May 23 '17 at 12:56
  • 1
    @ArashHowaida That can happen when a feature you plot were not used at all by the tree. I have updated the `decision_areas` function to accept a `n_features` parameter. Set it to the number of features in the data set (i.e. 4 with *Iris*) and you should be fine. – MB-F May 23 '17 at 13:16
  • @kazemakase I found your answer very helpful. Do you think it can be used to find the distances in n dimensions too? I posted that [question](https://stackoverflow.com/questions/60960692/find-distance-to-decision-boundary-in-decision-trees) a few days ago, but no correct answer yet. – Reveille Apr 06 '20 at 21:49
1

@kazemakase's approach is the "right" one. For completeness sake, here is simple way to get every "pixel" in Z that is a decision boundary:

steps = np.diff(Z,axis=0)[:,1:] + np.diff(Z,axis=1)[1:,:]
is_boundary = steps != 0
x,y = np.where(is_boundary)
# rescale to convert pixels into into original units
x = x.astype(np.float) * plot_step
y = y.astype(np.float) * plot_step

Plot of is_boundary (dilated so one can see all non-zero entries):

enter image description here

Paul Brodersen
  • 11,221
  • 21
  • 38
  • I also noticed that `is_boundary` is a bool list, and I'm not sure how matplotlib plots that. Perhaps because in python, there are no variables, just data and name links? Anyway my external visualization platform will not work this way. And when I ran the following to test out the coordinates before exporting to json: `plt.plot(x,y) plt.show()`, the result was not like the original. Could you try that out as well so we can figure it out? Thank you. – Arash Howaida May 13 '17 at 16:16
  • x,y are pixel locations that mark a boundary. They are not line segments (as they are not in order). If you wanted to plot line segments with `plt.plot`, you would have to extract them from the "image" `is_boundary`. Otherwise, you could just plot `is_boundary` with `plt.imshow`. – Paul Brodersen May 15 '17 at 13:47
1

For those interested, I had to recently also implement this for higher dimensional data, code was as follow:

number_of_leaves = (tree.tree_.children_left == -1).sum()
features = x.shape[1]
boundaries = np.zeros([number_of_leaves, features, 2])
boundaries[:,:,0] = -np.inf
boundaries[:,:,1] = np.inf

locs = np.where(tree.tree_.children_left == -1)[0]

for k in range(locs.shape[0]):
    idx = locs[k]
    idx_new = idx

    while idx_new != 0:
        i_check = np.where(tree.tree_.children_left == idx_new)[0]
        j_check = np.where(tree.tree_.children_right == idx_new)[0]

        if i_check.shape[0] == 1:
            idx_new = i_check[0]
            feat_ = tree.tree_.feature[idx_new]
            val_ = tree.tree_.value[idx_new]
            boundaries[k,feat_, 0] = val_
        elif j_check.shape[0] == 1:
            idx_new = j_check[0]
            feat_ = tree.tree_.feature[idx_new]
            val_ = tree.tree_.value[idx_new]
            boundaries[k,feat_, 1] = val_ 
        else: 
            print('Fail Case') # for debugging only - never occurs

Essentially I build up a n*d*2 tensor where n is the number of leaves of the tree, d is the dimensionality of the space and the third dimension holds the min and max values. Leaves are stored in tree.tree_.children_left / tree.tree_.children_right as -1, I then loop backwards to find the branch that caused the split onto the leaf and add the splitting criteria to the decision bounds.

j__
  • 248
  • 1
  • 11