Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs says:
Setting it up is extremely simple, as it should be when working with a fully-managed service:
- If you’re using the console, just switch the feature on.
- If you’re working with the Amazon SageMaker SDK, just set the train_use_spot_instances to true in the Estimator constructor.
SageMaker SDK sagemaker.estimator.Estimator says:
- use_spot_instances (bool) – Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max_wait arg should also be set.
- max_wait (int) – Timeout in seconds waiting for spot training instances (default: None). After this amount of time Amazon SageMaker will stop waiting for Spot instances to become available (default: None).
As per the documentations, run below.
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
entry_point="fashion_mnist_training.py",
source_dir="src",
metric_definitions=metric_definitions,
hyperparameters=hyperparameters,
role=role,
input_mode='File',
framework_version="2.3.1",
py_version="py37",
instance_count=1,
instance_type="ml.m5.xlarge",
use_spot_instances=True,
max_wait= 23 * 60 * 60,
base_job_name=base_job_name,
checkpoint_s3_uri=checkpoint_s3_uri,
model_dir=False # To avoid duplicate 'model_dir' command line argument
)
However, error is caused.
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Invalid MaxWaitTimeInSeconds. It must be present and be greater than or equal to MaxRuntimeInSeconds