AWS Sagemaker inference endpoint doesn't scale in with autoscaling

Question

I have an AWS Sagemaker inference endpoint with autoscaling enabled with SageMakerVariantInvocationsPerInstance target metric. When I send a lot of requests to the endpoint the number of instances correctly scales out to the maximum instance count. But after I stop sending the requests the number of instances doesn't scale in to 1, minimum instance count. I waited for many hours. Is there a reason for this behaviour?

Thanks

score 5 · Accepted Answer · answered Dec 18 '20 at 14:09

AutoScaling requires a cloudwatch alarm to trigger to scale in. Sagemaker doesn't push 0 value metrics when there's no activity (it just doesn't push anything). This leads to the alarm being put into insufficient data and not triggering the autoscaling scale in action when your workload suddenly ends.

Workarounds are either:

Have a step scaling policy using the cloudwatch metric math FILL() function for your scale in. This way you can tell CloudWatch "if there's no data, pretend this was the metric value when evaluating the alarm. This is only possible with step scaling since target tracking creates the alarms for you (and AutoScaling will periodically recreate them, so if you make manual changes they'll get deleted)
Have scheduled scaling set the size back down to 1 every evening
Make sure traffic continues at a low level for some times

SageMaker has now started publishing the 0 metric hence the endpoint should scale down to minimum where there is no traffic. This answer could be updated to mention this? — Harish Panwar, Jul 18 '22 at 04:18

AWS Sagemaker inference endpoint doesn't scale in with autoscaling

1 Answers1