1

I got a little confused when using models from sklearn, how do I set the specific optimization functions? for example, when RandomForestClassifier is used, how do I let the model 'know' that I want to maximize 'recall' or 'F1 score'. or 'AUC' instead of 'accuracy'?

Any suggestions? Thank you.

user6396
  • 1,832
  • 6
  • 23
  • 38
  • 3
    There are different classifiers for a reason, each of them is trained to maximize a different optimization function. In RandomForest, for example, each node is *greedily trained* to split and maximize the information gain of the Gini Criterion (Or Entropy of labelling) of the childrens. So, 1) RandomForest does not maximize the accuracy directly, and 2) *recall* or *F1 score* are not metrics that you train a model with, are metrics to evaluate different already trained models. You could always design variants of classifiers to maximize some of those scores, but not all of them are able to do. – Imanol Luengo Aug 30 '17 at 12:32
  • @ImanolLuengo. you cleared things a lot for me. Could you point me to one of the examples on how to " design variants of classifiers to maximize some of those scores"? Thank you. You are right about random forest using gini or entropy. what about other models? such as logistic regression (which uses maximum likelihood I suppose), or SVM or LDA. is there a way to specify different optimization functions? – user6396 Aug 31 '17 at 01:54
  • Not directly, and not in an easy way, you would have to mathematically reformulate the classifier's optimization function to introduce a penalty for your score (not always possible) and then code it. The easiest way to achieve it, as @MohammedKashif stated in his answer, is to train several models with different parameters and preserve the one that achieves maximum score in your metric. – Imanol Luengo Aug 31 '17 at 08:22

2 Answers2

4

What you are looking for is Parameter Tuning. Basically, first you select an estimator , then you define a hyper-parameter space (i.e. all possible parameters and their respective values that you want to tune), a cross validation scheme and scoring function. Now depending upon your choice of searching the parameter space, you can choose the following:

Exhaustive Grid Search In this approach, sklearn creates a grid of all possible combination of hyper-paramter values defined by the user using the GridSearchCV method. For instance, :

my_clf = DecisionTreeClassifier(random_state=0,class_weight='balanced')
param_grid = dict(
            classifier__min_samples_split=[5,7,9,11],
            classifier__max_leaf_nodes =[50,60,70,80],
            classifier__max_depth = [1,3,5,7,9]
            )

In this case, the grid specified is a cross-product of values of classifier__min_samples_split, classifier__max_leaf_nodes and classifier__max_depth. The documentation states that:

The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

An example for using GridSearch :

#Create a classifier 
clf = LogisticRegression(random_state = 0)

#Cross-validate the dataset
cv=StratifiedKFold(n_splits=n_splits).split(features,labels)

#Declare the hyper-parameter grid
param_grid = dict(
            classifier__tol=[1.0,0.1,0.01,0.001],
              classifier__C = np.power([10.0]*5,list(xrange(-3,2))).tolist(),
              classifier__solver =['newton-cg', 'lbfgs', 'liblinear', 'sag'],

             )

#Perform grid search using the classifier,parameter grid, scoring function and the cross-validated dataset
grid_search = GridSearchCV(clf, param_grid=param_grid, verbose=10,scoring=make_scorer(f1_score),cv=list(cv))

grid_search.fit(features.values,labels.values)

#To get the best score using the specified scoring function use the following
print grid_search.best_score_

#Similarly to get the best estimator
best_clf = grid_logistic.best_estimator_
print best_clf

You can read more about it's documentation here to know about the various internal methods, etc. to retrieve the best parameters, etc.

Randomized Search Instead of exhaustively checking for the hyper-parameter space, sklearn implements RandomizedSearchCV to do a randomized search over the paramters. The documentation states that:

RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.

You can read more about it from here.

You can read more about other approaches here.

Alternative link for reference:

Edit: In your case, if you want to maximize the recall for the model, you simply specify recall_score from sklearn.metrics as the scoring function.

If you wish to maximize 'False Positive' as stated in your question, you can refer this answer to extract the 'False Positives' from the confusion matrix. Then use the make scorer function and pass it to the GridSearchCV object for tuning.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • 1
    Thank you. I am aware of the parameter-tuning to get the best results, but I am more talking about the optimization function of the models themselves, anyway to change them? – user6396 Aug 31 '17 at 01:54
  • @user6396 according to your question, you want to optimise your model according to the scoring function you specify ? If that is the case, then that is the exact thing that happens in the paramter tuning modules of sklearn described above. Is there something else that I am missing ? – Gambit1614 Aug 31 '17 at 05:27
-3

I would suggest you grab a cup of coffee and read (and understand) the following

http://scikit-learn.org/stable/modules/model_evaluation.html

You need to use something along the lines of

cross_val_score(model, X, y, scoring='f1')

possible choices are (check the docs)

['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 
'average_precision', 'completeness_score', 'explained_variance', 
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 
'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 
'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 
'neg_mean_squared_log_error', 'neg_median_absolute_error', 
'normalized_mutual_info_score', 'precision', 'precision_macro', 
'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 
'recall', 'recall_macro', 'recall_micro', 'recall_samples', 
'recall_weighted', 'roc_auc', 'v_measure_score']

Have fun Umberto

Umberto
  • 1,387
  • 1
  • 13
  • 29
  • 1
    I don't think this actually answers the question. This relates to only to the *evaluation* of the model. Not the **optimization** of the model. – Andnp Aug 30 '17 at 16:44