I am using a pipeline in my python script used as entry point. In this pipeline as a first step, I am executing preprocessing. The pipeline is saved as a model. Therefore, the model endpoint finally also includes the preprocessing. Details (I am using scikit but should be similar for tensorflow):
If you want to call your train for example like this :
from sagemaker.sklearn.estimator import SKLearn
sklearn_estimator = SKLearn(
entry_point='script.py',
role = 'xxx',
train_instance_count=1,
train_instance_type='ml.c5.xlarge',
framework_version='0.20.0',
hyperparameters = {'cross-validation': 5,
'scoring': 'accuracy'})
then you have a entry point script. In this script ('script.py') you can have several steps which become part of the model that is finally saved. For example:
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
....
lr_tfidf = Pipeline([('vect', tfidf),
('clf', LogisticRegression(random_state=0))])
You need to save your model after the training at the end of the script via joblib.dump. This stored model is used to create the sagemaker model and the model endpoint. When I finally call predictor.predict(X_test) then the first step of the pipeline (my proeprocessing) is also executed and applied to X_test.
Sagemakers supports different ways of preprocessing. I just wanted to share a rather simple one that works fine for my scenario. I am using btw a GridSearch for the parameters of the steps of the pipeline in the script.py.