0

I have some text data that I would like to experiment multiple data transformation techniques with different, distinct sets of models in a single GridSearchCV call. The idea being: given Data transformation A in a pipeline with models 1, 2, and 3, and given Data transformation B in a pipeline with models 4, 5, and 6, which combination of either A with 1, 2, or 3 OR B with 4, 5, or 6 produces the best prediction results?

Currently what I have been doing is making 2 separate GridSearchCV calls - one with one pipeline and one with the other, but this seems pretty inefficient, even when done in a multiprocessing wrapper. I have also been looking around for a few examples on the internet for something similar to what I want to do.

I found this little tutorial: https://www.kaggle.com/evanmiller/pipelines-gridsearch-awesome-ml-pipelines but it only does half of what I want to do. Given the example below (taken from the "Pipeline 4.0 - contVars + taxes (FeatureUnion intro)" section of the linked tutorial):

pipeline = Pipeline([

    ('unity', FeatureUnion(
        transformer_list=[

            ('cont_portal', Pipeline([
                ('selector', PortalToColDimension(contVars)),
                ('cont_imp', Imputer(missing_values='NaN', strategy = 'median', axis=0)),
                ('scaler', StandardScaler())             
            ])),
            ('tax_portal', Pipeline([
                ('selector', PortalToColDimension(taxVars)),
                ('tax_imp', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
                ('scaler', MinMaxScaler(copy=True, feature_range=(0, 3)))
            ])),
        ],
    )),
    ('column_purge', SelectKBest(k = 5)),    
    ('lgbm', LGBMRegressor()),
])

parameters = {}
parameters['column_purge__k'] = [5, 10]

grid = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 2)
grid.fit(x_train, y_train)   

print('Best score and parameter combination = ')

print(grid.best_score_)    
print(grid.best_params_)    

y_pred = grid.predict(x_valid)

It appears that while the 'cont_portal' and 'tax_portal' produce two distinct data transformation pipelines (the first half of what I want to do), they both get directed to the LGBMRegressor. Is it possible to instead have, say, 'cont_portal' be used ONLY by a LGBMRegressor and have 'tax_portal' be used ONLY by a Logit model, for example, while still maintaining the single, general pipeline and single call to GridSearchCV?

mgrogger
  • 194
  • 1
  • 9

1 Answers1

1

Answering my own question because realize now that this is mostly due to my naivete with GridSearchCV. Hopefully this helps someone with the same question!

I created a ClfSwitcher() class according to this post for simplicity. In order to do what I am specifying in my question and have two separate processes, it's as easy as this:


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('kernel', Nystroem(kernel='linear')),
    ('clf', ClfSwitcher()),
])

parameters = [
    {
        'vect': [TfidfVectorizer(), CountVectorizer()],
        'kernel': [None],
        'clf__estimator':[LogisticRegression()],
        'clf__estimator__C':[100, 10, 5, 3, 1],
        'clf__estimator__random_state': [1],
        'clf__estimator__solver': ['liblinear'],
        'clf__estimator__class_weight': ['balanced']
    },
    {
        'kernel': [Nystroem(kernel='poly'), Nystroem(kernel='rbf')],
        'clf__estimator':[SGDClassifier()],
        'clf__estimator__C':[100, 10, 1, 0.1, 0.01],
        'clf__estimator__random_state': [1],
        'clf__estimator__class_weight': ['balanced']
     }
]

gscv = GridSearchCV(pipeline, parameters, cv=5, refit=True)
mgrogger
  • 194
  • 1
  • 9
  • what if you want to test multiple parameters of 'vect' or 'kernel'? Is it enough to prepend "vect__" or "kernel__" to the parameters? Where should you put them? – Antonio Sesto Nov 20 '21 at 06:43