I have some text data that I would like to experiment multiple data transformation techniques with different, distinct sets of models in a single GridSearchCV call. The idea being: given Data transformation A in a pipeline with models 1, 2, and 3, and given Data transformation B in a pipeline with models 4, 5, and 6, which combination of either A with 1, 2, or 3 OR B with 4, 5, or 6 produces the best prediction results?
Currently what I have been doing is making 2 separate GridSearchCV calls - one with one pipeline and one with the other, but this seems pretty inefficient, even when done in a multiprocessing wrapper. I have also been looking around for a few examples on the internet for something similar to what I want to do.
I found this little tutorial: https://www.kaggle.com/evanmiller/pipelines-gridsearch-awesome-ml-pipelines but it only does half of what I want to do. Given the example below (taken from the "Pipeline 4.0 - contVars + taxes (FeatureUnion intro)" section of the linked tutorial):
pipeline = Pipeline([
('unity', FeatureUnion(
transformer_list=[
('cont_portal', Pipeline([
('selector', PortalToColDimension(contVars)),
('cont_imp', Imputer(missing_values='NaN', strategy = 'median', axis=0)),
('scaler', StandardScaler())
])),
('tax_portal', Pipeline([
('selector', PortalToColDimension(taxVars)),
('tax_imp', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
('scaler', MinMaxScaler(copy=True, feature_range=(0, 3)))
])),
],
)),
('column_purge', SelectKBest(k = 5)),
('lgbm', LGBMRegressor()),
])
parameters = {}
parameters['column_purge__k'] = [5, 10]
grid = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 2)
grid.fit(x_train, y_train)
print('Best score and parameter combination = ')
print(grid.best_score_)
print(grid.best_params_)
y_pred = grid.predict(x_valid)
It appears that while the 'cont_portal' and 'tax_portal' produce two distinct data transformation pipelines (the first half of what I want to do), they both get directed to the LGBMRegressor. Is it possible to instead have, say, 'cont_portal' be used ONLY by a LGBMRegressor and have 'tax_portal' be used ONLY by a Logit model, for example, while still maintaining the single, general pipeline and single call to GridSearchCV?