5

I want to generate code (Python for now, but ultimately C) from a trained gradient boosted classifier (from sklearn). As far as I understand it, the model takes an initial predictor, and then adds predictions from sequentially trained regression trees (scaled by the learning factor). The chosen class is then the class with the highest output value.

This is the code I have so far:

def recursep_gbm(left, right, threshold, features, node, depth, value, out_name, scale):
    # Functions for spacing
    tabs = lambda n: (' ' * n * 4)[:-1]
    def print_depth():
        if depth: print tabs(depth),
    def print_depth_b():
        if depth: 
            print tabs(depth), 
            if (depth-1): print tabs(depth-1),

    if (threshold[node] != -2):
        print_depth()
        print "if " + features[node] + " <= " + str(threshold[node]) + ":"
        if left[node] != -1:
            recursep_gbm(left, right, threshold, features, left[node], depth+1, value, out_name, scale)
        print_depth()
        print "else:"
        if right[node] != -1:
            recursep_gbm(left, right, threshold, features, right[node], depth+1, value, out_name, scale)
    else:
        # This is an end node, add results
        print_depth()
        print out_name + " += " + str(scale) + " * " + str(value[node][0, 0])

def print_GBM_python(gbm_model, feature_names, X_data, l_rate):
    print "PYTHON CODE"

    # Get trees
    trees = gbm_model.estimators_

    # F0
    f0_probs = np.mean(clf.predict_log_proba(X_data), axis=0)
    probs    = ", ".join([str(prob) for prob in f0_probs])
    print "# Initial probabilities (F0)"
    print "scores = np.array([%s])" % probs
    print 

    print "# Update scores for each estimator"
    for j, tree_group in enumerate(trees):
        for k, tree in enumerate(tree_group):
            left      = tree.tree_.children_left
            right     = tree.tree_.children_right
            threshold = tree.tree_.threshold
            features  = [feature_names[i] for i in tree.tree_.feature]
            value = tree.tree_.value

            recursep_gbm(left, right, threshold, features, 0, 0, value, "scores[%i]" % k, l_rate)
        print

    print "# Get class with max score"
    print "return np.argmax(scores)"

I modified the tree generating code from this question.

This is an example of what it generates (with 3 classes, 2 estimators, 1 max depth and 0.1 learning rate):

# Initial probabilities (F0)
scores = np.array([-0.964890, -1.238279, -1.170222])

# Update scores for each estimator
if X1 <= 57.5:
    scores[0] += 0.1 * 1.60943587225
else:
    scores[0] += 0.1 * -0.908433703247
if X2 <= 0.000394500006223:
    scores[1] += 0.1 * -0.900203054177
else:
    scores[1] += 0.1 * 0.221484425933
if X2 <= 0.0340005010366:
    scores[2] += 0.1 * -0.848148803219
else:
    scores[2] += 0.1 * 1.98100820717

if X1 <= 57.5:
    scores[0] += 0.1 * 1.38506104792
else:
    scores[0] += 0.1 * -0.855930587354
if X1 <= 43.5:
    scores[1] += 0.1 * -0.810729087535
else:
    scores[1] += 0.1 * 0.237980820334
if X2 <= 0.027434501797:
    scores[2] += 0.1 * -0.815242297324
else:
    scores[2] += 0.1 * 1.69970863021

# Get class with max score
return np.argmax(scores)

I used the log probability as F0, based on this.

For one estimator it gives me the same predictions as the predict method on the trained model. However when I add more estimators the predictions start to deviate. Am I supposed to incorporate the step length (described here)? Also, is my F0 correct? Should I be taking the mean? And should I convert the log-probabilities to something else? Any help is greatly appreciated!

Community
  • 1
  • 1
Pokey McPokerson
  • 752
  • 6
  • 17
  • Have you read about model [persistence](http://scikit-learn.org/stable/modules/model_persistence.html)? Visualizing a gradient boosting model is more complex than interpreting individual decision trees. Feature importance is a common technique when [interpreting](http://scikit-learn.org/stable/modules/ensemble.html#interpretation) and visualizing the model. –  May 28 '16 at 19:51
  • The ultimate goal is to have the model running in C, hence wanting the code generation. As far as I can tell, model persistence only allows for saving the model to be run again in Python? – Pokey McPokerson May 30 '16 at 07:17

1 Answers1

2

Under the hood of a Gradient Boosting classifier is a sum of regression trees.

You can get the weak learner decision trees from the trained classifier by reading the estimators_ attribute. From the documentation, it is in fact a ndarray of DecisionTreeRegressor.

Finally, to fully reproduce the predict function, you also need to access the weights, as described in this answer.

Alternatively, you could export the GraphViz representation of a decision tree (instead of its pseudocode). Find a visual example below from scikit-learn.org: enter image description here

As a final, marginal note/suggestion, you might want to try also xgboost: besides other features, it has the built-in "dump model" functionality (to show all the decision trees under the trained model and save them to a text file).

Community
  • 1
  • 1
  • Thanks for the answer! The function I provided was meant to take the `estimators_` as an input (I edited it to make it rather take the model for clarity). That answer seems to suggest that the tree weights are baked into the trees themselves? The only extra weight is the learning rate, which I already added.. – Pokey McPokerson May 30 '16 at 07:23
  • I would also check if the graphical representation of each tree and its generated code do match: is there any "nested if" missing? –  May 30 '16 at 10:58
  • In the answer you copied from, there are the parentheses to manage "nested if". I doubt that you can omit them even though your max depth is 1... –  May 30 '16 at 11:17
  • The other answer's code is C, hence the parentheses. Nested ifs are handled by the indents in my version (for Python). – Pokey McPokerson Aug 16 '16 at 08:34