22

I would like to use the xgboost cv function to find the best parameters for my training data set. I am confused by the api. How do I find the best parameter? Is this similar to the sklearn grid_search cross-validation function? How can I find which of the options for the max_depth parameter ([2,4,6]) was determined optimal?

from sklearn.datasets import load_iris
import xgboost as xgb
iris = load_iris()
DTrain = xgb.DMatrix(iris.data, iris.target)
x_parameters = {"max_depth":[2,4,6]}
xgb.cv(x_parameters, DTrain)
...
Out[6]: 
   test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
0        0.888435       0.059403         0.888052        0.022942
1        0.854170       0.053118         0.851958        0.017982
2        0.837200       0.046986         0.833532        0.015613
3        0.829001       0.041960         0.824270        0.014501
4        0.825132       0.038176         0.819654        0.013975
5        0.823357       0.035454         0.817363        0.013722
6        0.822580       0.033540         0.816229        0.013598
7        0.822265       0.032209         0.815667        0.013538
8        0.822158       0.031287         0.815390        0.013508
9        0.822140       0.030647         0.815252        0.013494
kilojoules
  • 9,768
  • 18
  • 77
  • 149

4 Answers4

15

You can use GridSearchCV with xgboost through xgboost sklearn API

Define your classifier as follows:

from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV 

xgb_model = XGBClassifier(other_params)

test_params = {
 'max_depth':[4,8,12]
}

model = GridSearchCV(estimator = xgb_model,param_grid = test_params)
model.fit(train,target)
print model.best_params_
Rohit
  • 159
  • 1
  • 3
  • for me GridSearchCV is in here from sklearn.model_selection import GridSearchCV – Dimitar Nentchev Sep 27 '20 at 04:01
  • 1
    CMIIW, GridSearch is not same as cross_validation, many people get confused by expecting a returned model from cross-validation (you still could do this using cross_validate), but the basic purpose of cross validation is to give some projection/estimation for model performance on seen/unseen dataset. – Anggi Permana Harianja Sep 12 '22 at 06:36
13

Cross-validation is used for estimating the performance of one set of parameters on unseen data.

Grid-search evaluates a model with varying parameters to find the best possible combination of these.

The sklearn docs talks a lot about CV, and they can be used in combination, but they each have very different purposes.

You might be able to fit xgboost into sklearn's gridsearch functionality. Check out the sklearn interface to xgboost for the most smooth application.

Aske Doerge
  • 1,331
  • 10
  • 17
  • I have a question: "parameters" here means 2 things: (1) hyper parameters, e.g., the regularization lambda in Lasso, which is also an input of the model; (2) weight parameters, e.g., the linear coefficients in Lasso, which is auto-generated by the model. So CV is for estimating the performance of the hyper parameters on unseen data? Grid-search is used to find the best possible combination of these (hyper parameters?) the best is in what sense? best cv scores? If it is true, then why I cannot just use Grid-search to pick the best hyper parameters? – KevinKim Mar 26 '17 at 03:04
  • 1
    Grid Search evaluates a model with many sets of hyper parameters based on a metric you define. The best performing set of hyper parameters are returned. There is the possibility that the reported set of hyperparameters overfit your data. To remedy this, you can do CV for each set of hyper parameters instead of just calculating the metric as is. This has a better chance of avoiding overfitting. – Aske Doerge Mar 26 '17 at 14:18
  • 1
    In `GridSearch`, there is an option `cv`, which I always use. So I think I implemented what you said. Then I believe, after that, the model with the optimal set of hyper parameters (which are obtained by the GridSearch with cv), should outperform other models in the same class but with different set of hyperparameters I have searched in a completely new independent test data set that come from the same distribution as the training data set, is that correct? – KevinKim Mar 27 '17 at 01:59
9

Sklearn GridSearchCV should be a way to go if you are looking for parameter tuning. You need to just pass the xgb classifier to GridSearchCV and evaluate on the best CV score.

here is nice tutorial which might help you getting started with parameter tuning: http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Deepish
  • 746
  • 7
  • 10
7

I would go with hyperOpt

https://github.com/hyperopt/hyperopt

open sourced and worked great for me. If you do choose this and need help, I can elaborate.

When you ask to look over "max_depth":[2,4,6] you can naively solve this by running 3 models, each one with a max depth you want and see which model yields better results.

But "max_depth" is not the only hyper parameter you should consider tune. There are a lot of other hyper parameters, such as: eta (learning rate), gamma, min_child_weight, subsample and so on. Some are continues and some are discrete. (assuming you know your objective functions and evaluation metrics)

you can read about all of them here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

When you look on all those "parameters" and the size of dimension they create, its huge. You cannot search in it by hand (nor does an "expert" can give you the best arguments to them).

Therefor, hyperOpt gives you a neat solution to this, and builds you a search space which is not exactly random nor grid. All you need to do is define the parameters and their ranges.

You can find a code example here: https://github.com/bamine/Kaggle-stuff/blob/master/otto/hyperopt_xgboost.py

I can tell you from my own experience it worked better then Bayesian Optimization on my models. Give it a few hours/days of trial and error and contact me if you encounter issues you cannot solve.

Good luck!

Eran Moshe
  • 3,062
  • 2
  • 22
  • 41