143

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_, which works well for me.

However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic.

luator
  • 4,769
  • 3
  • 30
  • 51
user2244670
  • 1,431
  • 2
  • 10
  • 3

7 Answers7

176

There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.

In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.

(Note that both algorithms are available in the randomForest R package.)

[1]: Breiman, Friedman, "Classification and regression trees", 1984.

Gilles Louppe
  • 2,694
  • 1
  • 12
  • 8
  • 52
    It could be great if this answer was mentioned in the documentation of the importance attributes/example. Been searching for it for awhile too :) – d1337 Jul 04 '13 at 01:48
  • 3
    It seems the importance score is in relative value? For example, the sum of the importance scores of all features is always 1 (see the example here http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py) – RNA Apr 26 '14 at 07:45
  • 6
    @RNA: Yes, by default variable importances are normalized in scikit-learn, such that they sum to one. You can circumvent this by looping over the individual base estimators and calling `tree_.compute_feature_importances(normalize=False)`. – Gilles Louppe Jul 31 '14 at 07:13
  • 3
    @GillesLouppe Do you use the out of bag samples to measure the reduction in MSE for a forest of decision tree regressors in each tree? Or all training data used on the tree? – Cokes Jan 20 '15 at 20:27
  • 1
    Two useful resources. (1) http://blog.datadive.net/selecting-good-features-part-iii-random-forests/ a blog by Ando Saabas implements both "mean decrease impurity" and also "mean decrease in accuracy" as mentioned by Gilles. (2) Download and read Gilles Louppe's thesis. – Mark Teese Jul 27 '18 at 09:26
  • "often cited, but unfortunately rarely read": it's not open access anywhere on the internet – Martino Mar 12 '19 at 16:33
  • FYI, there is a pretty good document in version 0.22 of sklearn on [Permutation Importance vs Random Forest Feature Importance (MDI)](https://scikit-learn.org/dev/auto_examples/inspection/plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py) – jsga Oct 15 '19 at 10:10
  • "reaching that node". which node? – Salih Jan 10 '23 at 18:06
  • @Cokes, based on https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html Aug. 2023 the reduction in metric is measured using training statistics: "impurity-based importances are computed on training set statistics and therefore do not reflect the ability of feature to be useful to make predictions that generalize to the test set". Htis is a major problem in my eyes. – Ggjj11 Aug 31 '23 at 07:55
58

The usual way to compute the feature importance values of a single tree is as follows:

  1. you initialize an array feature_importances of all zeros with size n_features.

  2. you traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy, MSE, ...). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

Its important that these values are relative to a specific dataset (both error reduction and the number of samples are dataset specific) thus these values cannot be compared between different datasets.

As far as I know there are alternative ways to compute feature importance values in decision trees. A brief description of the above method can be found in "Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

smci
  • 32,567
  • 20
  • 113
  • 146
Peter Prettenhofer
  • 1,951
  • 18
  • 23
16

It's the ratio between the number of samples routed to a decision node involving that feature in any of the trees of the ensemble over the total number of samples in the training set.

Features that are involved in the top level nodes of the decision trees tend to see more samples hence are likely to have more importance.

Edit: this description is only partially correct: Gilles and Peter's answers are the correct answer.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 1
    Do you know if there is some paper/documentation about the exact method? eg. Breiman, 2001. It would be great if I had some proper document, which I could cite for the methodology. – user2244670 Apr 04 '13 at 12:27
  • 1
    @ogrisel it would be great if you could clearly mark your response as the explanation for the "weighting". The weighting alone does not determine the feature importance. The "impurity metric" ("gini-importance" or RSS) combined with the weights, averaged over trees determines the overall feature importance. Unfortunately the documentation on scikit-learn here: http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation is not accurate and incorrectly mentions "depth" as the impurity metric. – Ariel Sep 06 '17 at 03:24
14

As @GillesLouppe pointed out above, scikit-learn currently implements the "mean decrease impurity" metric for feature importances. I personally find the second metric a bit more interesting, where you randomly permute the values for each of your features one-by-one and see how much worse your out-of-bag performance is.

Since what you're after with feature importance is how much each feature contributes to your overall model's predictive performance, the second metric actually gives you a direct measure of this, whereas the "mean decrease impurity" is just a good proxy.

If you're interested, I wrote a small package that implements the Permutation Importance metric and can be used to calculate the values from an instance of a scikit-learn random forest class:

https://github.com/pjh2011/rf_perm_feat_import

Edit: This works for Python 2.7, not 3

Peter
  • 373
  • 3
  • 7
  • Hi @Peter when I use your code I get this error: NameError: name 'xrange' is not defined. – Aizzaac Jan 06 '17 at 23:50
  • Hi @Aizzaac. Sorry I'm new to writing packages, so I should've noted I wrote it for Python 2.7. Try def xrange(x): return iter(range(x)) before running it – Peter Jan 07 '17 at 07:09
4

code:

iris = datasets.load_iris()  
X = iris.data  
y = iris.target  
clf = DecisionTreeClassifier()  
clf.fit(X, y)  

decision_tree plot:
enter image description here
We get

compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071]   

Check source code:

cpdef compute_feature_importances(self, normalize=True):
    """Computes the importance of each feature (aka variable)."""
    cdef Node* left
    cdef Node* right
    cdef Node* nodes = self.nodes
    cdef Node* node = nodes
    cdef Node* end_node = node + self.node_count

    cdef double normalizer = 0.

    cdef np.ndarray[np.float64_t, ndim=1] importances
    importances = np.zeros((self.n_features,))
    cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data

    with nogil:
        while node != end_node:
            if node.left_child != _TREE_LEAF:
                # ... and node.right_child != _TREE_LEAF:
                left = &nodes[node.left_child]
                right = &nodes[node.right_child]

                importance_data[node.feature] += (
                    node.weighted_n_node_samples * node.impurity -
                    left.weighted_n_node_samples * left.impurity -
                    right.weighted_n_node_samples * right.impurity)
            node += 1

    importances /= nodes[0].weighted_n_node_samples

    if normalize:
        normalizer = np.sum(importances)

        if normalizer > 0.0:
            # Avoid dividing by zero (e.g., when root is pure)
            importances /= normalizer

    return importances

Try calculate the feature importance:

print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))

We get feature_importance: np.array([0,1.332,6.418,92.30]).

After normalized, we get array ([0., 0.01331334, 0.06414793, 0.92253873]),this is same as clf.feature_importances_.

Be careful all classes are supposed to have weight one.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
tengfei li
  • 41
  • 2
3

For those looking for a reference to the scikit-learn's documentation on this topic or a reference to the answer by @GillesLouppe:

In RandomForestClassifier, estimators_ attribute is a list of DecisionTreeClassifier (as mentioned in the documentation). In order to compute the feature_importances_ for the RandomForestClassifier, in scikit-learn's source code, it averages over all estimator's (all DecisionTreeClassifer's) feature_importances_ attributes in the ensemble.

In DecisionTreeClassifer's documentation, it is mentioned that "The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [1]."

Here is a direct link for more info on variable and Gini importance, as provided by scikit-learn's reference below.

[1] L. Breiman, and A. Cutler, “Random Forests”, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Makan
  • 547
  • 6
  • 8
0

Feature Importance in Random Forest

  • Random forest uses many trees, and thus, the variance is reduced
  • Random forest allows far more exploration of feature combinations as well
  • Decision trees gives Variable Importance and it is more if there is reduction in impurity (reduction in Gini impurity)
  • Each tree has a different Order of Importance

    Here is what happens in the background!
  • We take an attribute and check in all the trees where it is present and take the average values of the change in the homogeneity on this attribute split. This average value of change in the homogeneity gives us the feature importance of the attribute enter image description here
blackhole
  • 214
  • 1
  • 11