Working within a Sagemaker Jupyter Notebook I have an XGBoost pipeline which transforms my data and also runs some feature selection:
steps_xgb = [('scaler', MinMaxScaler()),
('feature_reduction', SelectKBest(mutual_info_classif)),
('xgb', XGBClassifier(objective='binary:logistic',
use_label_encoder=False,
random_state=1))]
pipeline_xgb = Pipeline(steps_xgb)
parameters_xgb = [
{
'scaler': [StandardScaler(), MinMaxScaler()],
'feature_reduction__k': randint(5, 80),
'xgb__min_child_weight': randint(1, 10),
'xgb__gamma': uniform(0, 0.5),
'xgb__subsample': uniform(0.4, 0.6),
'xgb__colsample_bytree': uniform(0.3, 0.7),
'xgb__max_depth': randint(2, 7),
'xgb__n_estimators': randint(100, 200),
'xgb__learning_rate': uniform(0.03, 0.3)
}
]
cv_xgb_events = RandomizedSearchCV(pipeline_xgb, param_distributions=parameters_xgb, cv = 5, scoring='roc_auc',
verbose = 10, n_iter = 200, n_jobs = 30)
After defining this I fit the model by calling:
cv_xgb_events.fit(X_train_events, y_train_events)
Which takes a very long time to run, even on a large instance, the notebooks end up shutting down. I looked into running hyper parameter tuning jobs through Sagemaker but I dont know how to incorporate the pipeline, but more specifically the feature selection part.
Im thinking there must be a way to deploy this job to an AWS service, whether in Sagemaker or not, that can fit the entire pipeline, but I'm not sure what that would be or the best way to do it.