4

I'm running Python Celery (a distributed task queue library) workers in an AWS ECS cluster (1 Celery worker running per EC2 instance), but the tasks are long running and NOT idempotent. This means that when an autoscaling scale-in event happens, which is when ECS terminates one of the containers running a worker because of low task load, the long running tasks currently in progress on that worker would be lost forever.

Does anyone have any suggestions on how to configure ECS autoscaling so no tasks are terminated before completion? Ideally, ECS scale-in event would initiate a warm-shutdown on the Celery worker in the EC2 instance it wants to terminate, but only ACTUALLY terminate the EC2 instance once the Celery worker has finished the warm shutdown, which occurs after all its tasks have completed.

I also understand there is something called instance protection, which can be set programmatically and protects instances from being terminated in a scale-in autoscale event: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-termination.html#instance-protection-instance

However, I'm not aware of any Celery signals which trigger after all tasks have finished out in a warm shutdown, so I'm not sure how I'd programmatically know when to disable the protection anyways. And even if I found a way to disable the protection at the right moment, who would manage which worker gets sent the shutdown signal in the first place? Can EC2 be configured to do a custom action to instances in a scale-in event (like doing a warm celery shutdown) instead of just terminating the EC2 instance?

2 Answers2

2

I think that while ECS scale-in your tasks it sends SIGTERM, wait for 30 seconds (default) and kill your task's containers with SIGKILL.

I think that you can increase the time between the signals with this variable: ECS_CONTAINER_STOP_TIMEOUT.

That way, your celery task could finish and no new tasks will be added to this celery worker (warm-shutdown after receiving the SIGTERM).

This answer might help you: https://stackoverflow.com/a/49564080/1011253

ItayB
  • 10,377
  • 9
  • 50
  • 77
  • thanks so much! That is a great suggestion. A new issue I'm having is that our deployment process (where we update the code for our workers) seems to be blocked until the worker containers fully die, which can take a really long time if they have long jobs to finish out. We have 50+ apps running on the same cluster, so having one instance hold up task placement can stop any of them from updating their task definition. Would you happen to have any ideas? We aren't sure what to do, AWS support said use AWS Batch but that won't be great for small jobs to start and finish quickly. – Dominic Napoleon Apr 06 '21 at 20:04
  • I'm looking for a solution where we can place new workers to consume new messages with updated code, but where the old workers can finish themselves out with a long ECS_CONTAINER_STOP_TIMEOUT and not block other task placements. – Dominic Napoleon Apr 06 '21 at 20:07
  • I guess you can do that if your auto-scaling will be based on the queue-depth. Seems like it is possible in ECS (https://aws.amazon.com/blogs/containers/autoscaling-amazon-ecs-services-based-on-custom-cloudwatch-and-prometheus-metrics/) I did it in Kubernetes, you can read about it here: https://itay-bittan.medium.com/hpa-for-celery-workers-6efd82444aee – ItayB Apr 07 '21 at 06:03
  • Did you come up with a workable solution @DominicNapoleon – Keir Whitlock Jun 08 '21 at 15:08
  • @KeirWhitlock not really. We were thinking about using AWS Batch but then decided against it, since we need the jobs to execute immediately, with no waiting – Dominic Napoleon Sep 13 '21 at 17:39
0

What we do in our company is we do not use ECS, just "plain" EC2 (for this particular service). We have an "autoscaling" task that runs every N minutes, which depending on situation scales up the cluster by M new machines (all configurable via AWS parameter store). So basically Celery scales up/down itself. The task I mentioned also sends shutdown signal to every worker older than 10 minutes that is completely idle. When Celery worker shuts down, the whole machine terminates (in fact, Celery worker shuts it down via the @worker_shutdown.connect handler that powers off the machine - all these EC2 instances have "terminate" shutdown policy). The cluster processes millions of tasks per day, some of them running for up to 12 hours...

DejanLekic
  • 18,787
  • 4
  • 46
  • 77