3

I want to create a simple pipeline with neuraxle (I know I can use other libraries but I want to use neuraxle) where I want to clean data, split it, train 2 models and compare them.

I want my pipeline to do something like this:

p = Pipeline([
    PreprocessData(),
    SplitData(),
    (some magic to start the training of both models with the split of the previous step)
    ("model1", model1(params))
    ("model2", model2(params))
    (evaluate)
])

I don't know if it's even possible because I couldn't find anything in the documentation.

Also I tried using other models than those from sklearn (e.g. catboost, xgboost ...) and I get the error

AttributeError: 'CatBoostRegressor' object has no attribute 'setup'

I thought about creating a class for the models but I won't use the hyperparam search of neuraxle

Guillaume Chevalier
  • 9,613
  • 8
  • 51
  • 79
kAch
  • 33
  • 1
  • 6

1 Answers1

2

Yes! You can do something like this:

p = Pipeline([
    PreprocessData(),
    ColumnTransformer([
        (0, model1(params)),  # Model 1 will receive Column 0 of data
        ([1, 2], model2(params)),  # Model 2 will receive Column 1 and 2 of data
    ], n_dimension=2, n_jobs=2),
    (evaluate)
])

The flow of data will be split into two.

The n_jobs=2 should create two threads. It may also be possible to pass a custom class for putting back the data together using the joiner argument. We'll be releasing some changes soon, so this should work properly. For now, the pipeline works with 1 thread.

For what regards your CatBoostRegressor model that is like sklearn but that doesn't come from sklearn, can you try to do SKLearnWrapper(model1(params)) instead of simply model1(params) when declaring your model in the pipeline? Probably that Neuraxle didn't recognize the model as a scikit-learn model (which is a BaseEstimator object in scikit-learn) even if your object had the same API as scikit-learn's BaseEstimator. So you may need to use the SKLearnWrapper manually around your model or to code your own similar wrapper to adapt your class to Neuraxle.

Related: https://stackoverflow.com/a/60302366/2476920


EDIT:

You can use the ParallelQueuedFeatureUnion class of Neuraxle. Example coming soon.

Also see this parallel pipeline usage example: https://www.neuraxle.org/stable/examples/parallel/plot_streaming_pipeline.html#sphx-glr-examples-parallel-plot-streaming-pipeline-py

Guillaume Chevalier
  • 9,613
  • 8
  • 51
  • 79