How to select features after running DecisionTreeRegressor algorithm?

Question

How Can I assign to a variable the highest top 20-30 feature based on the below values ?

# decision tree for feature importance on a regression problem
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from matplotlib import pyplot

# define the model
model = DecisionTreeRegressor()
# fit the model
model.fit(X_train_total, y_train)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

Output of the Code:

Feature: 0, Score: 0.19648
Feature: 1, Score: 0.00085
Feature: 2, Score: 0.00378
Feature: 3, Score: 0.00000
Feature: 4, Score: 0.00083
Feature: 5, Score: 0.00165
Feature: 6, Score: 0.00015
Feature: 7, Score: 0.00026
Feature: 8, Score: 0.00596
Feature: 9, Score: 0.00868
Feature: 10, Score: 0.00017
Feature: 11, Score: 0.00557
Feature: 12, Score: 0.00674
Feature: 13, Score: 0.00269
Feature: 14, Score: 0.01063
Feature: 15, Score: 0.00011
Feature: 16, Score: 0.01006
Feature: 17, Score: 0.00232
Feature: 18, Score: 0.00000
Feature: 19, Score: 0.01514
Feature: 20, Score: 0.00233
Feature: 21, Score: 0.00784
Feature: 22, Score: 0.04224
Feature: 23, Score: 0.00963
Feature: 24, Score: 0.04597
Feature: 25, Score: 0.00001
Feature: 26, Score: 0.00056
Feature: 27, Score: 0.00943
Feature: 28, Score: 0.00596
Feature: 29, Score: 0.00479
Feature: 30, Score: 0.00086
Feature: 31, Score: 0.00000
Feature: 32, Score: 0.00058
Feature: 33, Score: 0.00000
Feature: 34, Score: 0.00001
Feature: 35, Score: 0.00615
Feature: 36, Score: 0.00253
Feature: 37, Score: 0.00000
Feature: 38, Score: 0.00000
Feature: 39, Score: 0.00000
Feature: 40, Score: 0.00180
Feature: 41, Score: 0.00071
Feature: 42, Score: 0.00000
Feature: 43, Score: 0.00003
Feature: 44, Score: 0.00000
Feature: 45, Score: 0.00000
Feature: 46, Score: 0.00066
Feature: 47, Score: 0.00119
Feature: 48, Score: 0.00000
Feature: 49, Score: 0.00107
Feature: 50, Score: 0.00019
Feature: 51, Score: 0.00000
Feature: 52, Score: 0.00005
Feature: 53, Score: 0.00058
Feature: 54, Score: 0.00020
Feature: 55, Score: 0.00272
Feature: 56, Score: 0.00000
Feature: 57, Score: 0.00001
Feature: 58, Score: 0.00000
Feature: 59, Score: 0.00105
Feature: 60, Score: 0.01533
Feature: 61, Score: 0.00266

score 1 · Answer 1 · answered Aug 24 '21 at 21:52

First, let's make your example reproducible by adding data to it. I will be using the standard regression dataset -- Boston House Pricing.

from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from matplotlib import pyplot
from sklearn.datasets import load_boston

# get some data
X_train_total, y_train = load_boston(return_X_y=True)

# define the model
model = DecisionTreeRegressor()
# fit the model
model.fit(X_train_total, y_train)
# get importance
importance = model.feature_importances_
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

Feature: 0, Score: 0.03906
Feature: 1, Score: 0.00097
Feature: 2, Score: 0.00140
Feature: 3, Score: 0.00076
Feature: 4, Score: 0.06255
Feature: 5, Score: 0.57919
Feature: 6, Score: 0.01017
Feature: 7, Score: 0.07265
Feature: 8, Score: 0.00164
Feature: 9, Score: 0.01394
Feature: 10, Score: 0.00690
Feature: 11, Score: 0.00606
Feature: 12, Score: 0.20470

If you inspect importance you will notice that it's just a numpy array of importances. So, the only thing we need is to get indices of N maximum values in a NumPy array (click for the discussion).

I like this particular way:

n_top_features = 5
top_features = importance.argsort()[-n_top_features:]
print(top_features)  # [ 0  4  7 12  5]

Notice, they're in the ascending order of importance (least important first), you can use np.flip(top_features) to have them ordered from most to least important.

Now, if you want to retrain with only those top features you can

X_train_top_feat = X_train_total[:, top_features]

Warning: don't use pyplot.bar directly on the importance object. Make a copy (importance.copy()), as pyplot.bar sorts the array under the hood, but as it is a mutable object in python, so you're now left with a sorted array in your scope that you did not sort. You can spend a lot of time debugging stuff like this.

I am receiving following error for this code -- X_train_top_feat = X_train_total[:, top_features],-- how I resolve it ? TypeError: '(slice(None, None, None), array([ 16, 41, 133, 203, 102, 111, 220, 204, 231, 26, 154, 200, 31, 221, 12, 11, 6, 223, 1, 13, 27, 115, 9, 164, 4, 72, 180, 167, 20, 28, 19, 151, 8, 144, 78, 264, 22, 171, 0, 159]))' is an invalid key — Dataleon, Aug 24 '21 at 22:33
I've just re-ran my code, and it works. Judging by your indices you're applying it to **your** dataset, which you have not provided, so there is no way to help you with that. Can you first make sure that the answer I provided runs in your environment without errors? — Ufos, Aug 25 '21 at 12:23
@Dataleon That means, the problem is in the way your `X_train_total` differs from mine. At this point you will have to either provide us with your data or isolate the issue, and ask a separate question. I'd start by checking `type(X_train_total)` and `X_train_total.shape`. — Ufos, Aug 25 '21 at 13:28

How to select features after running DecisionTreeRegressor algorithm?

1 Answers1