4

I am attempting to get best hyperparameters for XGBClassifier that would lead to getting most predictive attributes. I am attempting to use RandomizedSearchCV to iterate and validate through KFold.

As I run this process total 5 times (numFolds=5), I want the best results to be saved in a dataframe called collector (specified below). So each iteration, I would want best results and score to append to collector dataframe.

 from scipy import stats
 from scipy.stats import randint
 from sklearn.model_selection import RandomizedSearchCV
 from sklearn.metrics import 
 precision_score,recall_score,accuracy_score,f1_score,roc_auc_score

clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.6),
              'subsample': stats.uniform(0.3, 0.9),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.9),
              'min_child_weight': [1, 2, 3, 4]
             }
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'roc_auc', error_score = 0, verbose = 3, n_jobs = -1)

numFolds = 5
folds = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)

collector = pd.DataFrame()
estimators = []
results = np.zeros(len(X))
score = 0.0

for train_index, test_index in folds:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf.fit(X_train, y_train)
    estimators.append(clf.best_estimator_)
    estcoll = pd.DataFrame(estimators)


    estcoll['score'] = score
    pd.concat([collector,estcoll])
    print "\n", len(collector), "\n"
score /= numFolds

For some reason there is nothing being saved to the dataframe, please help.

Also, I have about 350 attributes to cycle through with 3.5K rows in train and 2K in testing. Would running this through bayesian hyperparameter optimization process potentially improve my results? or it would only save on processing time?

zad0xlik
  • 183
  • 1
  • 4
  • 14

1 Answers1

10

RandomizedSearchCV() will do more for you than you realize. Explore the cv_results attribute of your fitted CV object at the documentation page

Here's your code pretty much unchanged. The two changes I added:

  1. I changed n_iter=5 from 25. This will do 5 sets of parameters, which with your 5-fold cross-validation means 25 total fits.
  2. I defined your kfold object before RandomizedSearchCV, and then referenced it in the consruction of RandomizedSearchCV as the cv param

_

clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.59),
              'subsample': stats.uniform(0.3, 0.6),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.4),
              'min_child_weight': [1, 2, 3, 4]
             }

numFolds = 5
kfold_5 = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)

clf = RandomizedSearchCV(clf_xgb, 
                         param_distributions = param_dist,
                         cv = kfold_5,  
                         n_iter = 5, # you want 5 here not 25 if I understand you correctly 
                         scoring = 'roc_auc', 
                         error_score = 0, 
                         verbose = 3, 
                         n_jobs = -1)

Here's where my answer deviates from your code significantly. Just fit the randomizedsearchcv object once, no need to loop. It handles the CV looping with it's cv argument.

clf.fit(X_train, y_train)

All your cross-valdated results are now in clf.cv_results_. For example, you can get cross-validated (mean across 5 folds) train score with: clf.cv_results_['mean_train_score'] or cross-validated test-set (held-out data) score with clf.cv_results_['mean_test_score']. You can also get other useful things like mean_fit_time, params, and clf, once fitted, will automatically remember your best_estimator_ as an attribute.

These are what are relevant for determining the best set of hyperparameters for model-fitting. A single set of hyperparameters is constant for each of the 5-folds used in a single iteration from n_iter, so you don't have to peer into the different scores between folds within an iteration.

freespace
  • 16,529
  • 4
  • 36
  • 58
Max Power
  • 8,265
  • 13
  • 50
  • 91
  • 1
    Thanks and this helps! Do you know why this error occurs and do i need to suppress/fix it? /model_selection/_validation.py:252: FitFailedWarning: Classifier fit failed. The score on this train-test partition for these parameters will be set to 0.000000. Details: XGBoostError('value 1.8782 for Parameter colsample_bytree exceed bound [0,1]',) "Details: \n%r" % (error_score, e), FitFailedWarning) – zad0xlik May 12 '17 at 05:18
  • the error message you posted says `XGBoostError('value 1.8782 for Parameter colsample_bytree exceed bound [0,1]`. The reason you're getting that is because `'colsample_bytree': stats.uniform(0.5, 0.9)` will yield a value sometimes outside the required [0,1] range for that parameter. – Max Power May 12 '17 at 05:38
  • @MaxPower do you know how to overcome this issue? – LetsPlayYahtzee Jul 22 '19 at 13:45
  • 1
    Hi @LetsPlayYahtzee, the solution to the issue in the comment above was to provide a distribution for each hyperparameter that will only ever produce valid values for that hyperparameter. For example, if you use [python's random.uniform(a,b)](https://docs.python.org/3/library/random.html#random.uniform), you can specify the min/max range (a,b) and be guaranteed to only get values in that range – Max Power Jul 22 '19 at 16:00
  • 3
    @MaxPower through digging a bit in the scipy documentation I figured the proper answer. If you want the `colsample_bytree` to sample from the [0.5, 0.9] uniform distribution you need to specify `stats.uniform(0.5, 0.4)` instead of `stats.uniform(0.5, 0.9)`; A bit unintuitive I know :) – LetsPlayYahtzee Jul 22 '19 at 18:34
  • Hi @LetsPlayYahtzee, you would still occasionally get an out-of-bounds value (outside [0,1]) with `stats.uniform(0.5, 0.4)`. I'd suggest using `random.uniform` as mentioned above instead, e.g. `random.uniform(0.01, 0.99)`, to ensure all values from the distribution are always valid. – Max Power Jul 23 '19 at 01:27
  • 2
    @MaxPower when specifying (0.5, 0.4) the range is [0.5, 0.9]; from docs the first arg is the loc and the second the scale - the final range is [loc, loc + scale]. I am not sure you are expected to get out of bounds results; even on 5M samples I won't find one - even though I get samples very close to 9 (0.899999779051796) . Having to sample the distribution beforehand also implies that you need to store all the samples in memory. You may not want to do that in many cases – LetsPlayYahtzee Jul 24 '19 at 16:57
  • thanks for the correction, I had thought `scipy.stats.uniform(a,b)` took a mean and stdev for some reason. sounds like your `stats.uniform` works fine as well. – Max Power Jul 24 '19 at 17:24
  • So you don't need to pass the evaluation in xgb.XGBClassifier()? – Vikrant Jun 21 '20 at 06:20