How to train an sklearn pipeline in AWS?

Question

Working within a Sagemaker Jupyter Notebook I have an XGBoost pipeline which transforms my data and also runs some feature selection:

steps_xgb = [('scaler', MinMaxScaler()),
         ('feature_reduction', SelectKBest(mutual_info_classif)),
         ('xgb', XGBClassifier(objective='binary:logistic', 
                               use_label_encoder=False,
                               random_state=1))]

pipeline_xgb = Pipeline(steps_xgb)

parameters_xgb = [
    
    {
        'scaler': [StandardScaler(), MinMaxScaler()],
        'feature_reduction__k': randint(5, 80),
        'xgb__min_child_weight': randint(1, 10),
        'xgb__gamma': uniform(0, 0.5),
        'xgb__subsample': uniform(0.4, 0.6),
        'xgb__colsample_bytree': uniform(0.3, 0.7),
        'xgb__max_depth': randint(2, 7),
        'xgb__n_estimators': randint(100, 200),
        'xgb__learning_rate': uniform(0.03, 0.3)
    }
    
]

cv_xgb_events = RandomizedSearchCV(pipeline_xgb, param_distributions=parameters_xgb, cv = 5, scoring='roc_auc',
                                   verbose = 10, n_iter = 200, n_jobs = 30)

After defining this I fit the model by calling:

cv_xgb_events.fit(X_train_events, y_train_events)

Which takes a very long time to run, even on a large instance, the notebooks end up shutting down. I looked into running hyper parameter tuning jobs through Sagemaker but I dont know how to incorporate the pipeline, but more specifically the feature selection part.

Im thinking there must be a way to deploy this job to an AWS service, whether in Sagemaker or not, that can fit the entire pipeline, but I'm not sure what that would be or the best way to do it.

score 1 · Answer 1 · answered Mar 30 '22 at 12:51

Sagemaker library has a scikit-learn wrapper and you can check out various example notebooks here. If you not familiar with AWS components(S3, SageMaker), you can start with a simple exercise - Train a logistic classifier in SageMaker. This notebook show how to use transformers, create features, train the model and deploy it with docker. Then you can refer catboost and xgboost notebook to train, fine-tuning a boosting algorithm (xgboost, catboost).

How to train an sklearn pipeline in AWS?

1 Answers1