Using the same preprocessing code for both training and inference in sagemaker

Question

I am working on building a machine learning pipeline for time series data where the goal is to retrain and update the model frequently to make predictions.

I have written a preprocessing code that handles the time series variables and transforms them.

I am confused about how to use the same preprocessing code for both training and inference? Should I write a lambda function to preprocess my data or is there any other way

Sources looked into:

The two examples given by the aws sagemaker team use AWS Glue to do the ETL tranform.

inference_pipeline_sparkml_xgboost_abalone

inference_pipeline_sparkml_blazingtext_dbpedia

I am new to aws sagemaker trying to learn, understand and build the flow. Any help is appreciated!

Yes along with numpy, pandas and statsmodels. I tried to write a lambda which would handle preprocessing but had no luck with the lambda layers limits. — Sandy, Nov 22 '19 at 07:18

score 1 · Answer 1 · answered Nov 25 '19 at 22:31

Answering the problems in a backwards fashion.

From your example, The below piece of code is the inference pipeline where 2 models are put together. In here we need to remove sparkml_model and get our sklearn model.

sm_model = PipelineModel(name=model_name, role=role, models=[sparkml_model, xgb_model])

Before placing the sklearn model, we need the SageMaker version of SKLearn model.

First create the SKLearn Estimator using SageMaker Python library.

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session)

script_path - this is python code that contains all the preprocessing logic or transformation logic. 'sklearn_abalone_featurizer.py' in the link given below.

Train the SKLearn Estimator

sklearn_preprocessor.fit({'train': train_input})

Create the SageMaker model from the SKLearn Estimator that can put in inference pipeline.

sklearn_inference_model = sklearn_preprocessor.create_model()

Inference PipeLineModel creation will be modified as indicated below.

sm_model = PipelineModel(name=model_name, role=role, models=[sklearn_inference_model, xgb_model])

For more details, refer the below link.

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.ipynb

Andi Schroff · Answer 2 · 2019-11-29T12:06:01.650

I am using a pipeline in my python script used as entry point. In this pipeline as a first step, I am executing preprocessing. The pipeline is saved as a model. Therefore, the model endpoint finally also includes the preprocessing. Details (I am using scikit but should be similar for tensorflow):

If you want to call your train for example like this :

from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
  entry_point='script.py',
  role = 'xxx',
  train_instance_count=1,
  train_instance_type='ml.c5.xlarge',
  framework_version='0.20.0',
  hyperparameters = {'cross-validation': 5,
                   'scoring': 'accuracy'})

then you have a entry point script. In this script ('script.py') you can have several steps which become part of the model that is finally saved. For example:

tfidf = TfidfVectorizer(strip_accents=None,
                    lowercase=False,
                    preprocessor=None)

....

lr_tfidf = Pipeline([('vect', tfidf),
                 ('clf', LogisticRegression(random_state=0))])

You need to save your model after the training at the end of the script via joblib.dump. This stored model is used to create the sagemaker model and the model endpoint. When I finally call predictor.predict(X_test) then the first step of the pipeline (my proeprocessing) is also executed and applied to X_test.

Sagemakers supports different ways of preprocessing. I just wanted to share a rather simple one that works fine for my scenario. I am using btw a GridSearch for the parameters of the steps of the pipeline in the script.py.

What is the Pipeline function here. Did you mean PipelineModel or is it a different function? — Daniel Wyatt, Mar 26 '21 at 10:22
Just realised its an sklearn function in case anyone else was wondering. See here: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html — Daniel Wyatt, Mar 26 '21 at 10:48
@AndiSchroff that's an SKLearn pipeline, not a SageMaker Inference Pipeline... — Neil McGuigan, Nov 02 '21 at 18:25

Using the same preprocessing code for both training and inference in sagemaker

2 Answers2