14

I have a notebook on SageMaker I would like to run every night. What's the best way to schedule this task. Is there a way to run a bash script and schedule Cron job from SageMaker?

VicariousAT
  • 173
  • 1
  • 1
  • 8

5 Answers5

20

Amazon SageMaker is a set of API that can help various machine learning and data science tasks. These API can be invoked from various sources, such as CLI, SDK or specifically from schedule AWS Lambda functions (see here for documentation: https://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html )

The main parts of Amazon SageMaker are notebook instances, training and tuning jobs, and model hosting for real-time predictions. Each one has different types of schedules that you might want to have. The most popular are:

  • Stopping and Starting Notebook Instances - Since the notebook instances are used for interactive ML models development, you don't really need them running during the nights or weekends. You can schedule a Lambda function to call the stop-notebook-instance API at the end of the working day (8PM, for example), and the start-notebook-instance API in the morning. Please note that you can also run crontab on the notebook instances (after opening the local terminal from the Jupyter interface).
  • Refreshing an ML Model - Automating the re-training of models, on new data that is flowing into the system all the time, is a common issue that with SageMaker is easier to solve. Calling create-training-job API from a scheduled Lambda function (or even from a CloudWatch Event that is monitoring the performance of the existing models), pointing to the S3 bucket where the old and new data resides, can create a refreshed model that you can now deploy into an A/B testing environment .

----- UPDATE (thanks to @snat2100 comment) -----

  • Creating and Deleting Real-time Endpoints - If your realtime endpoints are not needed 24/7 (for example, serving internal company users working during workdays and hours), you can also create the endpoints in the morning and delete them at night.
Guy
  • 12,388
  • 3
  • 45
  • 67
  • Hello Guy, do you think we can do the same thing on Model Endpoints (creation and deletion)? I am exposing an endpoint on a Webapp using ApiGateway and I want to use it only for specifi hour range. – HazimoRa3d Oct 04 '19 at 13:30
  • Sure @snat2100. If you don't need your endpoint all the time, deleting it and creating it on the next day is going to save you a lot of money. You can also consider using other services such as FarGate to host your Docker images. Please remember that it takes a few minutes for the endpoint to be created. – Guy Oct 04 '19 at 13:49
5

As of mid 2020, AWS provides several options to run a notebook as a cron job. It uses Papermill to inject parameters per run, and you can also use the CLI to run the notebook on demand.
You can: (1) use the AWS APIs or CLI directly; (2) use a convenience package, or (3) use a Jupyter Lab extension.

See this tutorial and the Quick Start guide for examples.

Omri
  • 723
  • 1
  • 11
  • 13
2

I dont think there is any way to schedule tasks on sagemaker. Notebook is meant more for interacting with the SageMaker runtime. Which is more for training and hosting ML models.

I am presuming you want retrain your model every night. There are two ways of achieving that, retrain your model somewhere else and then upload to S3 and recreate your docker container every night using an external script. Or Provide your own docker container which has a cron job scheduled within it. Give that to SageMaker to deploy.

Raman
  • 643
  • 5
  • 6
  • This is essentially what I would like to do. I have a model. Each day I want to spin up a machine with the docker image and process the days worth of new data to predict. There is no documentation on this. – Keith Jul 24 '18 at 20:53
  • 1
    @Kieth, there is no documentation on that because SageMaker is more for hosting and training you models. Not automating a workflow. I would just add a python script that would retrain the model on the new data, run some tests, and then upload that to S3 with a new version number. Set my SageMaker to take the latest version. Everytime a new version of the model is uploaded, SageMaker would complete serving the current requests and load the new model to service the rest of the requests. Hopefully that helps. – Raman Jul 25 '18 at 16:28
  • I am not sure what you mean by automating workflow. I want to host a trained model for batch prediction. For obvious reasons (time, resources, stability) retaining each time is not a good idea. This is the most common deployment model and it is not supported. – Keith Jul 25 '18 at 22:50
  • +1 in Keith. If you need to schedule a training set like this, probably it's better to use a scheduled task in EC2 or ECS. The point of SM, as mentioned before, it's the interaction on the top of Jupyter in a scalable way. If you need to productionize it's way better to set a repo with the source code, setup a Docker/Kubernetes and put it in an orchestrator that you will have not only a better way to schedule it but save tons os computational resources, have a proper code version control (Jupyter Notebooks it's way hard to review) and debuggability and you can apply proper CI/CD. – Flavio Jul 24 '19 at 10:30
2

You have access to the Notebook terminal in the AWS console Jupyter page (In the upper right corner, select new --> Terminal). If cron is enough for you, maybe crontab in there will suffice.

If you have big, expensive jobs that can be ran in container, consider also AWS Batch. There you can e.g. try to use spot pricing for the needed instances. Batch jobs can be initiated by Cloudwatch Events (e.g. cron trigger).

Lauri Laanti
  • 948
  • 7
  • 11
0

Now the task is simplified a lot by means of 2 services, one is Stepfunctions that allows you to create workflows through connectors to multiple AWS services.As an example, a simple pipeline could be started by starting a crawler then a glue job and finally a sagemaker notebook, now to schedule this process eventbrige is used as a cron for this task, note that SF is highly parameterizable.