0

I'm currently working on a use case using RandomForestRegressor. To get training and test data separately based on one column, let's say Home, the dataframe was split into dictionary. Almost done with the modelling, but stuck in getting the feature importance for each of the key in dictionary (number of keys = 21). Please have a look at the codes below:

hp = pd.get_dummies(hp)
hp = {i: g for i, g in hp.set_index(["Home"]).groupby(level = [0])}

feature = {}; feature_train = {}; feature_test = {}
target = {}; target_train = {}; target_test = {}; target_pred = {}
importances = {}

for k, v in hp.items():
    target[k] = np.array(v["HP"])
    feature[k] = v.drop(["HP", "Corr"], axis = 1)

feature_list = list(feature[1].columns)

for k, v in zip(feature, target):
    feature[k] = np.array(feature[v])
for k, v in zip(feature_train, target_train):
    feature_train[k], feature_test[k], target_train[k], target_test[k] = train_test_split(
            feature[v], target[v], test_size = 0.25, random_state = 42)

What I've tried after a help from Random Forest Feature Importance Chart using Python

for name, importance in zip(feature_list, list(rf.feature_importances_)):
    print(name, "=", importance)

but this prints importance for only one of the dictionary (and I don't know which). What I want is to get it printed for all the keys in dictionary "importances". Thanks in advance!

PratikSharma
  • 321
  • 2
  • 17
  • Why do you "get training and test data separately based on one column, let's say Home"? Why don't you use [sklearn's train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) directly? It's hard to tell what's going on in your code tbh. – Szymon Maszke Jan 17 '19 at 11:07

1 Answers1

0

If I understand you correctly, you want feature's importance for both train and test data.

That's not how it works, first it creates RandomForest from your training data, and after that operation it can calculate importance of each feature based on how many times it was used to split the space (and how 'good' were the splits, e.g. how low was, for example, the gini impurity, for many trees of course).

So you obtain feature's importance for training data, for test data the learned tree architecture is used in order to predict values.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83