1

I want to create multi-output classifier. However, my problem is that the distribution of positive label for each output varied greatly e.g. for output 1 there are 2% positive label and for output 2 there are 20% positive label. So, I want to separate data sampling and model fitting for each output into multiple stream (multiple sub-pipeline) where each sub-pipeline perform oversampling separately, and hyperparameters both for oversampling and classifier are optimized separately too.

For example, supposed that I have

from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

X = # some input features array here
y = np.array([[0,1],
              [0,1],
              [0,0],
              [1,0],
              [0,0]]) # unbalance label distribution

y_1 = y[:, 0]
y_2 = y[:, 1]


param_grid_shared = {'oversampler__sampling_strategy': [0.2, 0.4, 0.5], 'logit__C': [1, 0.1, 0.01]}

pipeline_output_1 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_1 = GridSearchCV(pipeline_output_1, param_grid_shared)
grid_1.fit(X, y_1)

pipeline_output_2 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_2 = GridSearchCV(pipeline_output_2, param_grid_shared)
grid_2.fit(X, y_2)

And I want to combine them to create something like

multi_pipe = Pipeline([(Something to separate X and y into multiple streams)
                       ((pipe_1, pipeline_output_1),
                       (pipe_2, pipeline_output_2)), # 2 pipeline optimized separately
                       (Evaluate and select hyperparameters for each pipeline separately)
                       (Something to combine output from pipeline 1 and pipeline 2)
                      ]) 

in Neuraxle or Sklearn

MultiOutputClassifier definitely won't fit for this case, and I am not quite sure where to look for the solution now.

  • Good question! Issue created here: https://github.com/Neuraxio/Neuraxle/issues/473 A workaround would be to create different data samplers, and different AutoML objects and metrics, and train the two pipelines with two different AutoML loops but using the same pipeline. The data sampler could sample dfferently for the multiple outputs using data stores (repositories) in the the context's services. A better solution is yet to be found. – Guillaume Chevalier Apr 13 '21 at 16:26
  • I also added the following issue which is another idea described in the answer below: https://github.com/Neuraxio/Neuraxle/issues/474 – Guillaume Chevalier Apr 13 '21 at 16:31
  • @GuillaumeChevalier Thank you. Now I will just work around this, and hope for new feature. – ton nuttapong Apr 14 '21 at 17:37

1 Answers1

1

I created an issue with the following idea:

pipe_1_with_oversampler_1 = Pipeline([
    Oversampler1().assert_has_services(DataRepository), Pipeline1()])
pipe_2_with_oversampler_2 = Pipeline([
    Oversampler2().assert_has_services(DataRepository), Pipeline2()])

multi_pipe = Pipeline([
    DataPreprocessingStep(),
    # Evaluate and select hyperparameters for each pipeline separately, but within one run, using `multi_pipe.fit(...)`: 
    FeatureUnion([
        AutoML(pipe_1_with_oversampler_1, **automl_args_1),
        AutoML(pipe_2_with_oversampler_2, **automl_args_2)
    ]),
    # And then combine output from pipeline 1 and pipeline 2 using feature union. 
    # Can do preprocessing and postprocessing as well.
    PostprocessingStep(),
])

For this to work, the AutoML object could be refactored into a regular step, and therefore useable in place of one.

Guillaume Chevalier
  • 9,613
  • 8
  • 51
  • 79