Try multiple estimator in one grid-search

Question

Is there a way we can grid-search multiple estimators at a time in Sklearn or any other library. For example can we pass SVM and Random Forest in one grid search ?.

I was trying to create a grid search for multiple algorithms at once — tj89, Sep 14 '16 at 03:21

score 33 · Answer 1 · edited Feb 24 '21 at 05:55

33

Yes. Example:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', SGDClassifier()),
])
parameters = [
    {
        'vect__max_df': (0.5, 0.75, 1.0),
        'clf': (SGDClassifier(),),
        'clf__alpha': (0.00001, 0.000001),
        'clf__penalty': ('l2', 'elasticnet'),
        'clf__n_iter': (10, 50, 80),
    }, {
        'vect__max_df': (0.5, 0.75, 1.0),
        'clf': (LinearSVC(),),
        'clf__C': (0.01, 0.5, 1.0)
    }
]
grid_search = GridSearchCV(pipeline, parameters)

edited Feb 24 '21 at 05:55

flyingdutchman

1,197
11
17

answered Oct 20 '16 at 13:10

j-a

1,780
1
21
19

4

Hi j-a, thanks for the answer. What I was rather looking for is how to create a pipeline where we can use two models like SGDClassifier and SVM in parallel. In this case the results from CountVectorizer is passed to SGDClassifier. Anyways I changes my approach a bit to solve the problem. – tj89 Nov 04 '16 at 14:18
1

@tj89 it will run in parallel, but I suppose you mean specifically that CountVectorizer should be run once and then its result reused for each classifier?. How did you change your approach? – j-a Nov 07 '16 at 10:34
1

I found (sklearn==0.23.2) you can just put None for the 'clf' in the pipeline. No need for dummy SGDClassifier. – Ryan J McCall Nov 10 '21 at 23:02

score 22 · Answer 2 · edited Oct 25 '20 at 14:26

22

    from sklearn.base import BaseEstimator
    from sklearn.model_selection import GridSearchCV
    
    class DummyEstimator(BaseEstimator):
        def fit(self): pass
        def score(self): pass
        
    # Create a pipeline
    pipe = Pipeline([('clf', DummyEstimator())]) # Placeholder Estimator
    
    # Candidate learning algorithms and their hyperparameters
    search_space = [{'clf': [LogisticRegression()], # Actual Estimator
                     'clf__penalty': ['l1', 'l2'],
                     'clf__C': np.logspace(0, 4, 10)},
                    
                    {'clf': [DecisionTreeClassifier()],  # Actual Estimator
                     'clf__criterion': ['gini', 'entropy']}]
    
    
    # Create grid search 
    gs = GridSearchCV(pipe, search_space)

edited Oct 25 '20 at 14:26

FamousSnake

328
4
11

answered Nov 14 '18 at 02:30

Brian Spiering

1,002
1
9
18

How would you proceed if using OneVsRestClassifier, where the estimators you are testing are called within OneVsRestClassifier ? You seem to be able to pass the different estimators/param grids to the external estimator, however I just can't find a way to pass parameters to the inner estimator. Just wandering if there is any magic to accomplish all together. Even if I do separate grid search for each inner estimator, I still face the issue I do not know how to pass parameters to the inner estimators, for grid search. – Julian C Oct 08 '19 at 08:29
Think you can just put None in place of DummyEstimator. – Ryan J McCall Nov 10 '21 at 23:03

score 13 · Answer 3 · answered Mar 01 '18 at 15:16

I think what you were looking for is this:

from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

names = [
         "Naive Bayes",
         "Linear SVM",
         "Logistic Regression",
         "Random Forest",
         "Multilayer Perceptron"
        ]

classifiers = [
    MultinomialNB(),
    LinearSVC(),
    LogisticRegression(),
    RandomForestClassifier(),
    MLPClassifier()
]

parameters = [
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__alpha': (1e-2, 1e-3)},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__C': (np.logspace(-5, 1, 5))},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__C': (np.logspace(-5, 1, 5))},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__max_depth': (1, 2)},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__alpha': (1e-2, 1e-3)}
             ]

for name, classifier, params in zip(names, classifiers, parameters):
    clf_pipe = Pipeline([
        ('vect', TfidfVectorizer(stop_words='english')),
        ('clf', classifier),
    ])
    gs_clf = GridSearchCV(clf_pipe, param_grid=params, n_jobs=-1)
    clf = gs_clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print("{} score: {}".format(name, score))

why have you pre-fixed it with clf? can you call it anything you want — Maths12, May 23 '20 at 18:11
You can really call it anything you want, @Maths12, but by being consistent in the choice of prefix allows you to do parameter tuning with `GridSearchCV` for each estimator. You can get the same effect by using the _name_ in the example above though. — Jakob, May 26 '20 at 22:31
This creates multiple grid searches but the question asked for 1 grid search. — Ryan J McCall, Nov 10 '21 at 22:34

score 3 · Answer 4 · answered Jan 10 '19 at 08:05

You can use TransformedTargetRegressor. This class is designed for transforming the target variable before fitting, taking a regressor and a set of transformers as parameters. But you may give no transformer, then the identity transformer (i.e. no transformation) is applied. Since regressor is a class parameter, we can change it by grid search objects.

import numpy as np
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

Y = np.array([1,2,3,4,5,6,7,8,9,10])
X = np.array([0,1,3,5,3,5,7,9,8,9]).reshape((-1, 1))

For doing grid search, we should specify the param_grid as a list of dict, each for different estimator. This is because different estimators use different set of parameters (e.g. setting fit_intercept with MLPRegressor causes error). Note that the name "regressor" is automatically given to the regressor.

model = TransformedTargetRegressor()
params = [
    {
        "regressor": [LinearRegression()],
        "regressor__fit_intercept": [True, False]
    },
    {
        "regressor": [MLPRegressor()],
        "regressor__hidden_layer_sizes": [1, 5, 10]
    }
]

We can fit as usual.

g = GridSearchCV(model, params)
g.fit(X, Y)

g.best_estimator_, g.best_score_, g.best_params_

# results in like
(TransformedTargetRegressor(check_inverse=True, func=None, inverse_func=None,
               regressor=LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
          normalize=False),
               transformer=None),
 -0.419213380219391,
 {'regressor': LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
           normalize=False), 'regressor__fit_intercept': False})

score 2 · Answer 5 · answered Dec 25 '18 at 23:02

What you can do is create a class that takes in any classifier and for each classifier any setting of parameters.

Create a switcher class that works for any estimator

from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):

def __init__(
    self, 
    estimator = SGDClassifier(),
):
    """
    A Custom BaseEstimator that can switch between classifiers.
    :param estimator: sklearn object - The classifier
    """ 

    self.estimator = estimator


def fit(self, X, y=None, **kwargs):
    self.estimator.fit(X, y)
    return self


def predict(self, X, y=None):
    return self.estimator.predict(X)


def predict_proba(self, X):
    return self.estimator.predict_proba(X)


def score(self, X, y):
    return self.estimator.score(X, y)

Now you can pre-train your tfidf however you like.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit(data, labels)

Now create a pipeline with this pre-trained tfidf

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf',tfidf), # Already pretrained/fit
    ('clf', ClfSwitcher()),
])

Perform hyper-parameter optimization

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV



parameters = [
    {
        'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
        'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
        'clf__estimator__max_iter': [50, 80],
        'clf__estimator__tol': [1e-4],
        'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
    },
    {
        'clf__estimator': [MultinomialNB()],
        'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
    },
]

gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, verbose=3)
# param optimization
gscv.fit(train_data, train_labels)

How to interpret `clfestimatorloss`

clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.