1

I have N features per day in my dataframe, going back 20 days (time series): I have ~400 features x 100k rows.

I’m trying to identify the most important features, so I’ve trained my XGBoost model by this way:

model = xgb.XGBRegressor(learning_rate=0.01, n_estimators=1000, max_depth=20)

eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="rmse", eval_set=eval_set, verbose=True, early_stopping_rounds=20)

And then:

def plot_fimportance(xgbmodel, df_x, top_n=30):
    features = df_x.columns.values
    mapFeat = dict(zip(["f"+str(i) for i in range(len(features))],features))
    ts = pd.Series(xgbmodel.booster().get_fscore())
    ts.index = ts.reset_index()['index'].map(mapFeat)
    ts.order()[-top_n:].plot(kind="barh", x = 'Feature', figsize = (8, top_n-10), title=("feature importance"))

plot_fimportance(model, df.drop(['label']))

I've heard that the parameter max_depth should be calculated thus:

max_depth = number of features / 3

I think this may work with small datasets, but if I train my model with max_depth=133 my PC might explode, and probably I would have overfitting as well.

How coucanld I calculate the optimal value of max_depth with this huge number of features?

Prune
  • 76,765
  • 14
  • 60
  • 81
mllamazares
  • 7,876
  • 17
  • 61
  • 89

1 Answers1

2

That equation doesn't give you the optimal depth; it's merely a heuristic. If you want the optimal depth, then you have to find it empirically: find a functional starting point and vary in each direction. Apply gradient descent to approach the best answer.

If all you wanted was the maximum limit that would run on your machine, you could tediously compute the storage requirements and find the largest value. To balance this with overfitting ... you need to choose your trade-offs, and you're still stuck with the experimentation.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Could you explain or give an example of _Apply gradient descent_, please? Thanks! :) – mllamazares Mar 22 '17 at 00:11
  • That moves in to the realm of "tutorial", which is beyond the range of Stack Overflow's purpose. In this case, think of it as the Newton-Raphson method for finding the solution of an equation. Very briefly, you run it with a couple of choices for depth. See which one works best for you. Adjust the depth and run again. Repeat this process, adjusting appropriately to find the optimal point, until you get close enough that you can declare that you're done. – Prune Mar 22 '17 at 00:17
  • Ok, now I see your point, I could do that with [GridSearchCV()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). But, what if based on the rmse it determines that 82 is the _optimal_ value... how do I know that there is no overfitting? And don't forget that all this is only for find out the top 10 features (I wouldn't work with the 400 column in the _real live_). – mllamazares Mar 22 '17 at 00:38
  • 1
    Well, how do you *normally* detect overfitting? There's no difference here. – Prune Mar 22 '17 at 00:48
  • Sorry, I thought overfitting was directly related with `max_depth`. But is `max_features` instead. http://stackoverflow.com/a/22546016/1709738 Then can we conclude that the higher max_depth, the better? Thank you! – mllamazares Mar 22 '17 at 01:06
  • 1
    In general, yes. Going deeper *can* promote overfitting; actually, anything that improves the training process can cause overfitting. The root cause is how faithfully your training data represents the set of all available inputs. If you have gaps in the coverage, then hard training will adapt to those gaps, and the resulting model will not work well on input that comes from those gaps. – Prune Mar 22 '17 at 01:11
  • 1
    In short, worry about max_depth first; leave overfitting for later. – Prune Mar 22 '17 at 01:11