1

I just came across this example on Model Grid Selection here:

https://chrisalbon.com/machine_learning/model_selection/model_selection_using_grid_search/

Question:

The example reads

# Create a pipeline
pipe = Pipeline([('classifier', RandomForestClassifier())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
                 'classifier__penalty': ['l1', 'l2'],
                 'classifier__C': np.logspace(0, 4, 10)},
                {'classifier': [RandomForestClassifier()],
                 'classifier__n_estimators': [10, 100, 1000],
                 'classifier__max_features': [1, 2, 3]}]lassifier', RandomForestClassifier())])

As I understand the code, search_space contains the used classifiers and their parameters. However, I don't get what the purpose of Pipeline and why it contains RandomForestClassifier()?

Background: In my desired workflow, I need to train a doc2vec model (gensim), based on 3 different classifiers. Both the model and the classifiers should apply GridSearch to parameters. I like to store the results in a table and save the best model, that is the one with the highest accuracy.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Christopher
  • 2,120
  • 7
  • 31
  • 58
  • What do you mean 'why is it here'? – Bram Vanroy May 10 '18 at 11:58
  • As I understand the code, `search_space` contains the used classifiers and their parameters. However, I don't get what the purpose of `Pipeline` is and why `RandomForestClassifier()` is in here? Edited the text. – Christopher May 10 '18 at 12:00
  • I had the exact same question. It seems to me that the pipeline creation step is almost like an initialization of a pipeline, and then in the search_space array, the `classifier` key each time overwrites the `RandomForestClassifier()` of the `pipe = ...` line. I have been searching for an answer for this over the past few days, and I even messaged Chris Albon, but no luck yet. I am not sure if I am right. – nvergos Oct 18 '18 at 20:19

1 Answers1

0

Pipeline is used to chain sequential data transformation models followed last by the classifier / regressor. Something like first converting the text to numbers using TfidfVectorizer and then training the classifier.

pipe = Pipeline([('vectorizer',TfidfVectorizer()), 
                 ('classifier', RandomForestClassifier())])

For only a single class, no need of Pipeline.

Here in your code, its used as a placeholder, so that the parameters can be used by using the 'classifier' prefix. And the classifier itself can be substituted from the params.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Does this mean that with each dictionary in `search_space` the `classifier` (`estimator` as first argument of `GridSearchCV`) gets substituted by the value of the `classifier` key in the dictionary? e.g., `LogisticRegression()` in the first case. – nvergos Oct 18 '18 at 20:22
  • 1
    @nvergos Yes. Correct. The pipeline will handle the order of assiging of parameters. First the `classifier` will be assigned and then `classifier__penalty` or `classifier__n_estimators` will be assigned. – Vivek Kumar Oct 22 '18 at 11:35