0

I can see the answer at How is scikit-learn GridSearchCV best_score_ calculated? for the what this score means.

I am working with scikit learn example for decision tree and trying various values for scoring parameter.

if __name__ == '__main__':
   df = pd.read_csv('/Users/tcssig/Downloads/ad-dataset/ad.data', header=None)
   explanatory_variable_columns = set(df.columns.values)
   response_variable_column = df[len(df.columns.values)-1]
   # The last column describes the targets
   explanatory_variable_columns.remove(len(df.columns.values)-1)
   y = [1 if e == 'ad.' else 0 for e in response_variable_column]
   X = df[list(explanatory_variable_columns)]
   X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
   X_train, X_test, y_train, y_test = train_test_split(X, y)
   pipeline = Pipeline([('clf', DecisionTreeClassifier(criterion='entropy'))])
   parameters = {'clf__max_depth': (150, 155, 160), 'clf__min_samples_split': (1, 2, 3), 'clf__min_samples_leaf': (1, 2, 3)}
   grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='accuracy')
   grid_search.fit(X_train, y_train)
   print ('Best score: %0.3f' % grid_search.best_score_)
   best_parameters = grid_search.best_estimator_.get_params()
   for param_name in sorted(parameters.keys()):
        print ('\t%s: %r' % (param_name, best_parameters[param_name]))
   predictions = grid_search.predict(X_test)
   print (classification_report(y_test, predictions))

Every time I get a diff value for best_score_, ranging from 0.92 to 0.96.

Should this score determine the Scoring parameter value that I should finally use. Also on scikit learn website, I see that accuracy value should not be used in case of imbalanced classification.

Community
  • 1
  • 1
Sarang Manjrekar
  • 1,839
  • 5
  • 31
  • 61

1 Answers1

1

The best_score_ value is different every time because you have not passed a fixed value for random_state in your DecisionTreeClassifier. You can do the following in order to get the same value every time you run your code on any machine.

random_seed = 77   ##It can be any value of your choice
pipeline = Pipeline([('clf', DecisionTreeClassifier(criterion='entropy', random_state = random_seed))])

I hope this will be useful.

enterML
  • 2,110
  • 4
  • 26
  • 38
  • Actually, i get a diff best_score_ value for passing different scoring methods, such as 'Accuracy', 'F1' etc. So, want to know as to how we decide the value to be passed in Scoring parameter. Shall we lokk at the best_score_ to decide this ? – Sarang Manjrekar Sep 13 '16 at 17:27
  • From your question and your comment on this answer, I suggest you read the Sk-learn documentation for GridSearchCV and then re-read the stackoverflow post you linked to originally. – Nick Becker Sep 13 '16 at 18:26
  • Totally agree with Nick Becker. – enterML Sep 13 '16 at 19:07
  • I did refer it again, but the point is I can not figure out if GridSearchCV best_score_ is directly related to quality of the Grid Search cross validation fit. – Sarang Manjrekar Sep 14 '16 at 07:23