19

I read following example on Pipelines and GridSearchCV in Python: http://www.davidsbatista.net/blog/2017/04/01/document_classification/

Logistic Regression:

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(LogisticRegression(solver='sag')),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    "clf__estimator__C": [0.01, 0.1, 1],
    "clf__estimator__class_weight": ['balanced', None],
}

SVM:

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(LinearSVC()),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    "clf__estimator__C": [0.01, 0.1, 1],
    "clf__estimator__class_weight": ['balanced', None],
}

Is there a way that Logistic Regression and SVM could be combined into one Pipeline? Say, I have a TfidfVectorizer and like to test against multiple classifiers that each then output the best model/parameters.

cgnorthcutt
  • 3,890
  • 34
  • 41
Christopher
  • 2,120
  • 7
  • 31
  • 58
  • Possible duplicate of [Alternate different models in Pipeline for GridSearchCV](https://stackoverflow.com/questions/50265993/alternate-different-models-in-pipeline-for-gridsearchcv). – Vivek Kumar May 11 '18 at 06:41
  • What you are doing [here in this question](https://stackoverflow.com/questions/50272416/gridsearch-on-model-and-classifiers) is correct. Thats how I did it in my above answer. – Vivek Kumar May 11 '18 at 06:51

3 Answers3

29

Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.

Create a switcher class that works for any estimator

from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):

def __init__(
    self, 
    estimator = SGDClassifier(),
):
    """
    A Custom BaseEstimator that can switch between classifiers.
    :param estimator: sklearn object - The classifier
    """ 

    self.estimator = estimator


def fit(self, X, y=None, **kwargs):
    self.estimator.fit(X, y)
    return self


def predict(self, X, y=None):
    return self.estimator.predict(X)


def predict_proba(self, X):
    return self.estimator.predict_proba(X)


def score(self, X, y):
    return self.estimator.score(X, y)

Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:

Perform hyper-parameter optimization

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', ClfSwitcher()),
])

parameters = [
    {
        'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': ['english', None],
        'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
        'clf__estimator__max_iter': [50, 80],
        'clf__estimator__tol': [1e-4],
        'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
    },
    {
        'clf__estimator': [MultinomialNB()],
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': [None],
        'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
    },
]

gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)

How to interpret clf__estimator__loss

clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

cgnorthcutt
  • 3,890
  • 34
  • 41
  • Hi @cgnorthcutt, I have been using your solution for a little while now and it's the best I've seen so far for trying multiple models easily. The only drawbacks are that it doesn't work very well after for stuff like ConfusionMatrix.from_estimator (to which I pass my grid_search object and it doesn't seem to find the fitted best estimator) and OneVsRestClassifier(SVC()) for example that already wraps an estimator. Have you found workarounds to it, for your solution, or maybe you've been using totally different approaches since then? Thanks a lot. Antoine – Antoine101 May 26 '23 at 13:52
6

Yes, you can do that by building a wrapper function. The idea is to pass it two dictionaries: the models and the the parameters;

Then you iteratively call the models with all the parameters to test, using GridSearchCV for this.

Check this example, there is added extra functionality so that at the end you output a data frame with the summary of the different models/parameters and different performance scores.

EDIT: It's too much code to paste here, you can check a full working example here:

http://www.davidsbatista.net/blog/2018/02/23/model_optimization/

David Batista
  • 3,029
  • 2
  • 23
  • 42
5

This is how I did it without a wrapper function. You can evaluate any number of classifiers. Each one can have multiple parameters for hyperparameter optimization.

The one with best score will be saved to disk using pickle

from sklearn.svm import SVC
from operator import itemgetter
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
#pipeline parameters
    parameters = \
        [ \
            {
                'clf': [MultinomialNB()],
                'tf-idf__stop_words': ['english', None],
                'clf__alpha': [0.001, 0.1, 1, 10, 100]
            },

            {
                'clf': [SVC()],
                'tf-idf__stop_words': ['english', None],
                'clf__C': [0.001, 0.1, 1, 10, 100, 10e5],
                'clf__kernel': ['linear', 'rbf'],
                'clf__class_weight': ['balanced'],
                'clf__probability': [True]
            },

            {
                'clf': [DecisionTreeClassifier()],
                'tf-idf__stop_words': ['english', None],
                'clf__criterion': ['gini','entropy'],
                'clf__splitter': ['best','random'],
                'clf__class_weight':['balanced', None]
            }
        ]

    #evaluating multiple classifiers
    #based on pipeline parameters
    #-------------------------------
    result=[]

    for params in parameters:

        #classifier
        clf = params['clf'][0]

        #getting arguments by
        #popping out classifier
        params.pop('clf')

        #pipeline
        steps = [('tf-idf', TfidfVectorizer()), ('clf',clf)]

        #cross validation using
        #Grid Search
        grid = GridSearchCV(Pipeline(steps), param_grid=params, cv=3)
        grid.fit(features, labels)

        #storing result
        result.append\
        (
            {
                'grid': grid,
                'classifier': grid.best_estimator_,
                'best score': grid.best_score_,
                'best params': grid.best_params_,
                'cv': grid.cv
            }
        )

    #sorting result by best score
    result = sorted(result, key=itemgetter('best score'),reverse=True)

    #saving best classifier
    grid = result[0]['grid']
    joblib.dump(grid, 'classifier.pickle')

Tarun Pathak
  • 247
  • 4
  • 4