1

Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs says:

Setting it up is extremely simple, as it should be when working with a fully-managed service:

  • If you’re using the console, just switch the feature on.
  • If you’re working with the Amazon SageMaker SDK, just set the train_use_spot_instances to true in the Estimator constructor.

SageMaker SDK sagemaker.estimator.Estimator says:

  • use_spot_instances (bool) – Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max_wait arg should also be set.
  • max_wait (int) – Timeout in seconds waiting for spot training instances (default: None). After this amount of time Amazon SageMaker will stop waiting for Spot instances to become available (default: None).

As per the documentations, run below.

from sagemaker.tensorflow import TensorFlow


estimator = TensorFlow(
    entry_point="fashion_mnist_training.py",
    source_dir="src",
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    role=role,
    input_mode='File',
    framework_version="2.3.1",
    py_version="py37",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    use_spot_instances=True,
    max_wait= 23 * 60 * 60, 
    base_job_name=base_job_name,
    checkpoint_s3_uri=checkpoint_s3_uri,
    model_dir=False  # To avoid duplicate 'model_dir' command line argument
)

However, error is caused.

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Invalid MaxWaitTimeInSeconds. It must be present and be greater than or equal to MaxRuntimeInSeconds
mon
  • 18,789
  • 22
  • 112
  • 205

2 Answers2

2

Another AWS SageMaker incorrect documentation. just set the train_use_spot_instances to true in the Estimator constructor is not enough.

Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs:

Setting it up is extremely simple, as it should be when working with a fully-managed service:

  • If you’re using the console, just switch the feature on.
  • If you’re working with the Amazon SageMaker SDK, just set the train_use_spot_instances to true in the Estimator constructor.

MaxWaitTimeInSeconds is required to be equal or greater than MaxRuntimeInSeconds.

SageMaker API StoppingCondition

MaxRuntimeInSeconds

The maximum length of time, in seconds, that a training or compilation job can run.

MaxWaitTimeInSeconds

The maximum length of time, in seconds, that a managed Spot training job has to complete. It is the amount of time spent waiting for Spot capacity plus the amount of time the job can run. It must be equal to or greater than MaxRuntimeInSeconds. If the job does not complete during this time, Amazon SageMaker ends the job.

Fix

from sagemaker.tensorflow import TensorFlow


estimator = TensorFlow(
    entry_point="fashion_mnist_training.py",
    source_dir="src",
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    role=role,
    input_mode='File',
    framework_version="2.3.1",
    py_version="py37",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    use_spot_instances=True,
    max_wait= 23 * 60 * 60, 
    max_run = 24 * 60 * 60,     <----------
    base_job_name=base_job_name,
    checkpoint_s3_uri=checkpoint_s3_uri,
    model_dir=False 
)

Related

SageMaker Managed Spot Training with Object Detection algorithm

mon
  • 18,789
  • 22
  • 112
  • 205
  • `max_wait= 23 * 60 * 60` and `max_run = 24 * 60 * 60 - is should be the other way around, right? `max_wait` should be greater than `max_run`. – Marek Grzenkowicz May 23 '22 at 08:28
0

The official AWS Managed Spot Training Link is here

It clearly address the following:

Set EnableManagedSpotTraining to True and specify the MaxWaitTimeInSeconds. MaxWaitTimeInSeconds must be larger than MaxRuntimeInSeconds.

Dharman
  • 30,962
  • 25
  • 85
  • 135