I have N features per day in my dataframe, going back 20 days (time series): I have ~400 features x 100k rows.
I’m trying to identify the most important features, so I’ve trained my XGBoost model by this way:
model = xgb.XGBRegressor(learning_rate=0.01, n_estimators=1000, max_depth=20)
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="rmse", eval_set=eval_set, verbose=True, early_stopping_rounds=20)
And then:
def plot_fimportance(xgbmodel, df_x, top_n=30):
features = df_x.columns.values
mapFeat = dict(zip(["f"+str(i) for i in range(len(features))],features))
ts = pd.Series(xgbmodel.booster().get_fscore())
ts.index = ts.reset_index()['index'].map(mapFeat)
ts.order()[-top_n:].plot(kind="barh", x = 'Feature', figsize = (8, top_n-10), title=("feature importance"))
plot_fimportance(model, df.drop(['label']))
I've heard that the parameter max_depth should be calculated thus:
max_depth = number of features / 3
I think this may work with small datasets, but if I train my model with max_depth=133
my PC might explode, and probably I would have overfitting as well.
How coucanld I calculate the optimal value of max_depth with this huge number of features?