I was trying to get the optimum features for a decision tree classifier over the Iris dataset using sklearn.grid_search.GridSearchCV
. I used StratifiedKFold (sklearn.cross_validation.StratifiedKFold
) for cross-validation, since my data was biased. But on every execution of GridSearchCV
, it returned a different set of parameters.
Shouldn't it return the same set of optimum parameters given that the data and the cross-validation was same every single time?
Source code follows:
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(all_classes, n_folds=10)
grid_search = GridSearchCV(decision_tree_classifier, param_grid = parameter_grid,
cv = cross_validation)
grid_search.fit(all_inputs, all_classes)
print "Best Score: {}".format(grid_search.best_score_)
print "Best params: {}".format(grid_search.best_params_)
Outputs:
Best Score: 0.959731543624
Best params: {'max_features': 2, 'max_depth': 2}
Best Score: 0.973154362416
Best params: {'max_features': 3, 'max_depth': 5}
Best Score: 0.973154362416
Best params: {'max_features': 2, 'max_depth': 5}
Best Score: 0.959731543624
Best params: {'max_features': 3, 'max_depth': 3}
This is an excerpt from an Ipython notebook which I made recently, with reference to Randal S Olson's notebook, which can be found here.
Edit:
Its not the random_state
parameter of StratifiedKFold
which results in varied results but rather the random_state
parameter of DecisionTreeClassifer
which randomly initializes the tree and gives varied results (refer documentation). As for StratifiedKFold
, as long as the shuffle
parameter is set to False
(default), it generates the same training-test split (refer documentation).