10

I am solving a binary classification problem over some text documents using Python and implementing the scikit-learn library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5. I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models use gridSearchCV(). I cannot have multiple Pipelines running during a single implementation due to concurrency issues, hence I need to implement all the different models using one pipeline.

This is what I have till now,

# pipeline for naive bayes
naive_bayes_pipeline = Pipeline([
    ('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
    ('tf_idf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

# accessing and using the pipelines
naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender'])

# pipeline for SVM
svm_pipeline = Pipeline([
    ('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
    ('tf_idf', TfidfTransformer()),
    ('classifier', SVC())
])

param_svm = [
  {'classifier__C': [1, 10], 'classifier__kernel': ['linear']},
  {'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
]

grid_svm_skf = GridSearchCV(
    svm_pipeline,  # pipeline from above
    param_grid=param_svm,  # parameters to tune via cross validation
    refit=True,  # fit using all data, on the best detected classifier
    n_jobs=-1,  # number of cores to use for parallelization; -1 uses "all cores"
    scoring='accuracy',
    cv=StratifiedKFold(train_data['gender'], n_folds=5),  # using StratifiedKFold CV with 5 folds
)

svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender'])
predictions_svm_skf = svm_skf.predict(test_data['data'])

EDIT 1: The second pipeline is the only pipeline using gridSearchCV(), and never seems to be executed.

EDIT 2: Added more code to show gridSearchCV() use.

denbuttigieg
  • 185
  • 1
  • 1
  • 12
  • What do you mean by concurrency issues? Are you running out of memory? How about saving each pipeline (after it is fit) to a file? Then load the one you want and train your model. Also, please share any error messages you are seeing. – pault Jan 29 '18 at 18:32
  • Can you elaborate more about "I cannot have multiple Pipelines running during a single implementation due to concurrency issues", I suspect this is the [X-Y problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). At least, it is not obvious to me what concurrency issues would be solved by a `Pipeline`. – juanpa.arrivillaga Jan 29 '18 at 18:35
  • @pault I can't seem to start the execution of the second pipeline, given that I already have a running pipeline. – denbuttigieg Jan 29 '18 at 18:35
  • What I am trying to achieve is to evaluate my training data against different models. To do this, I am using pipelines to extract features from the data and then classify it. However, the execution of the second pipeline never seems to start when running the program. – denbuttigieg Jan 29 '18 at 18:37
  • Well, pipelines still run serially as far as I am aware. Perhaps your first pipeline is simply taking a long time? Grid search can take a long time. – juanpa.arrivillaga Jan 29 '18 at 18:40
  • I am not using gridsearch() in my first pipeline, but the first pipeline is executing just fine, and the results are being achieved as required. – denbuttigieg Jan 29 '18 at 18:41
  • 1
    So, then the second pipeline is the one using grid-search... why do you say it never appears to be executed? I think you should expand on this as an [edit](https://stackoverflow.com/posts/48507651/edit) to your question, before this becomes a long chain of comments. – juanpa.arrivillaga Jan 29 '18 at 18:43
  • I think it would make sense to add a piece of code where you're calling `gridSearchCV()`... – MaxU - stand with Ukraine Jan 29 '18 at 18:49
  • @MaxU, done. (edit2) – denbuttigieg Jan 29 '18 at 18:53
  • 1
    @denbuttigieg, try to pass `GridSearchCV(..., verbose=3)` and check what does it output... – MaxU - stand with Ukraine Jan 29 '18 at 18:59
  • Also, with the verbose param, do try for n_jobs=1 first. And if it works, then increase it. – Vivek Kumar Jan 30 '18 at 02:26

1 Answers1

11

Consider checking out similar questions here:

  1. Compare multiple algorithms with sklearn pipeline
  2. Pipeline: Multiple classifiers?

To summarize,

Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.

Create a switcher class that works for any estimator

from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):

def __init__(
    self, 
    estimator = SGDClassifier(),
):
    """
    A Custom BaseEstimator that can switch between classifiers.
    :param estimator: sklearn object - The classifier
    """ 

    self.estimator = estimator


def fit(self, X, y=None, **kwargs):
    self.estimator.fit(X, y)
    return self


def predict(self, X, y=None):
    return self.estimator.predict(X)


def predict_proba(self, X):
    return self.estimator.predict_proba(X)


def score(self, X, y):
    return self.estimator.score(X, y)

Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:

Perform hyper-parameter optimization

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', ClfSwitcher()),
])

parameters = [
    {
        'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': ['english', None],
        'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
        'clf__estimator__max_iter': [50, 80],
        'clf__estimator__tol': [1e-4],
        'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
    },
    {
        'clf__estimator': [MultinomialNB()],
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': [None],
        'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
    },
]

gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)

How to interpret clf__estimator__loss

clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

cgnorthcutt
  • 3,890
  • 34
  • 41
  • I am familiar with `GridSearchCV` in the traditional case with one estimator. Can you explain what is actually happening in the `GridSearchCV` when you provide parameters with two estimators? Does it perform 5-fold CV twice (i.e., one round for the `SGDClassifier` and one round for `MultinomialNB`) and then repeat it for each set of grid parameters? – slaw Feb 04 '19 at 13:52
  • Do you know if it is possible to provide multiple datasets as a parameter so that I can fit different estimators with different datasets? – slaw Feb 05 '19 at 14:42
  • 1
    Sure.. `for dataset in datasets: gscv.fit(...)` – cgnorthcutt Feb 06 '19 at 15:02
  • I don't think that would work as the multiple calls to `gscv.fit` would clobber the fit from the last dataset. I want each of the calls to `fit` with different datasets to be appended. – slaw Feb 06 '19 at 15:47
  • Clobber? Just initialize each time. `gscv = GridSearchCV(); gscv.fit()` There isn't much more to this. – cgnorthcutt Feb 08 '19 at 19:22
  • @ cgnorthcutt how does one extract the scores for say each estimator ( SGDClassifier() or MultinomialNB()), given that it's not using *named_steps*? – GSA May 10 '22 at 22:46