I've looked in multiple SO questions about this but still didn't find a solution, and I'm working on it for several days now, so any help would be appreciated!
I'm using celery 4.4.0, SQS as a broker, and Django 2.2, most of my tasks are pretty long (1 to 4 hours).
This is the command to start the workers:
celery worker -A config.celeryconfig:app -Ofair --prefetch-multiplier=1 --max-tasks-per-child=2
And this configuration within my Django config file:
CELERYD_PREFETCH_MULTIPLIER = 1
CELERY_ACKS_LATE = True
task_acks_late = True # I wasn't sure what's the name of the ack late configuration.
BROKER_TRANSPORT_OPTIONS = {
'polling_interval': 3,
'region': 'us-east-1',
'visibility_timeout': 3600 # 1 hour,
}
Scenario - let's say we have 2 workers - 1,2 and each with one sub-process and two tasks a(longer than 1 hours),b,c and I've dispatched them around the same time:
worker 1 picked up task a
worker 1 picked up task c (but not executing)
worker 2 picked up task b
worker 2 finished with task b
worker 2 picked up task c (because of the visibility_timeout)
worker 2 finished up task c
worker 1 finished with task a
worker 1 start with task c
worker 1 finish with task c
So:
- Workers are still prefetching tasks.
- (much worse) The same task can be executed twice (or even more, if there were
more workers and the execution time is longer than the
visibility_timeout
). It's discussed in details in here https://github.com/celery/celery/issues/4400.
I've realized the solution to the above problems would be disabling the pre-fetching behavior, but so far I wasn't able to achieve that.
I'm so frustrated, so if you have any idea on how to solve this - please let me know!
Few notes:
*I saw it as well in my production env where we have more than 1 subprocess per worker.
*I used to dispatch tasks using countdown (which resulted in ETA tasks), but have disabled it and the issue still persists.
- I'm not using result_backend, I'm handling everything on a model in the DB that I've created.
*I don't want to setCONCURRENCY=1
because I want to have one worker per machine with subprocess=CPU cores in the machine.
*I've tried so many combinations of the celery configurations mentioned above, but none worked (the one listed above is based on this comment - https://stackoverflow.com/a/58958823)
*Another possible solution which I prefer not to use is increasing the visibility_timeout to a very big number. But this could result in one task committing all the long tasks one after the other and no distribution of them between all the workers.
*Not sure if related - I'm deploying celery using ElasticBeanstalk on an EC2 machine. *Another possible solution that I'm currently considering - is checking the status of the task (We use Pending/In Progress, etc statuses), and only if it's pending - continue (This might not solve the entire use cases because of a possible race condition but it should solve most of them).