43

What are the implications of disabling gossip, mingle, and heartbeat on my celery workers?

In order to reduce the number of messages sent to CloudAMQP to stay within the free plan, I decided to follow these recommendations. I therefore used the options --without-gossip --without-mingle --without-heartbeat. Since then, I have been using these options by default for all my celery projects but I am not sure if there are any side-effects I am not aware of.

Please note:

  • we now moved to a Redis broker and do not have that much limitations on the number of messages sent to the broker
  • we have several instances running multiple celery workers with multiple queues
nbeuchat
  • 6,575
  • 5
  • 36
  • 50

4 Answers4

31

This is the base documentation which doesn't give us much info

heartbeat

Is related to communication between the worker and the broker (in your case the broker is CloudAMQP). See explanation

With the --without-heartbeat the worker won't send heartbeat events

mingle

It only asks for "logical clocks" and "revoked tasks" from other workers on startup.

Taken from whatsnew-3.1

The worker will now attempt to synchronize with other workers in the same cluster.

Synchronized data currently includes revoked tasks and logical clock.

This only happens at startup and causes a one second startup delay to collect broadcast responses from other workers.

You can disable this bootstep using the --without-mingle argument.

Also see docs

gossip

Workers send events to all other workers and this is currently used for "clock synchronization", but it's also possible to write your own handlers on events, such as on_node_join, See docs

Taken from whatsnew-3.1

Workers are now passively subscribing to worker related events like heartbeats.

This means that a worker knows what other workers are doing and can detect if they go offline. Currently this is only used for clock synchronization, but there are many possibilities for future additions and you can write extensions that take advantage of this already.

Some ideas include consensus protocols, reroute task to best worker (based on resource usage or data locality) or restarting workers when they crash.

We believe that although this is a small addition, it opens amazing possibilities.

You can disable this bootstep using the --without-gossip argument.

Trevor Boyd Smith
  • 18,164
  • 32
  • 127
  • 177
ofirule
  • 4,233
  • 2
  • 26
  • 40
  • 4
    Could you elaborate on the application consequences of disabling heartbeats? For example, if disabled will workers ever detect if the broker becomes unavailable? If so, how will they detect it? CloudAMQP (https://www.cloudamqp.com/docs/celery.html) suggests heartbeats are not necessary because "We've enabled low TCP keep-alive intervals on all our RabbitMQ servers so that stale connections will be detected on the TCP level instead of in the application layer." Is that something unique to CloudAMQP or is this is basic reality of any AMQP connection to a RabbitMQ broker? Thanks! – user1847 Apr 06 '21 at 13:02
  • RE "is [low TCP keep-alive intervals on all our RabbitMQ servers] unique to CloudAMQP or [does this apply] to any AMQP connection to a RabbitMQ broker?": that depends on how you configured celery. if you use the default c librabbitmq i could not find what socket config is used. if you use the python amqp library the default socket config [is `TCP_KEEPALIVE=1` (bool), `TCP_KEEPIDLE=60` (secs), `TCP_KEEPINTVL=10` (secs), `TCP_KEEPCNT=9` (count), `TCP_USER_TIMEOUT=1000` (ms).](https://github.com/celery/py-amqp/blob/59af2445549e4e1bb87a932cbb4d196629f16dd6/Changelog#L593) – Trevor Boyd Smith Jan 31 '22 at 21:08
  • for the default [c librabbit the code sets tcp keepalive true... and nothing more](https://github.com/alanxz/rabbitmq-c/blob/cc7e1578856b1bc6bbb10e67242bfefa473a8ed7/librabbitmq/amqp_socket.c#L322): . so that means it uses the system defaults for the other three parameters. for example on centos7 the default parameters are (from `sysctl -a|grep -i tcp|grep -i keep`): `TCP_KEEPIDLE=7200`, `TCP_KEEPINTVL=75`, `TCP_KEEPCNT=9`. if your system uses a value of 7200 seconds then i recommend keeping your celery worker heartbeat enabled (or your connection can die during a long period of idle). – Trevor Boyd Smith Jan 31 '22 at 21:21
  • the `cloudamqp` people are saying they set the tcp keep idle interval to a lower level (probably like 30 to 60 seconds) and so that may force all the clients to have respond at the tcp level and would force all celery users to have the same tcp keep interval (even if the client has 7200 seconds). but unfortunately i'm not an expert on tcp nor sockets and so i can not say with certainty. – Trevor Boyd Smith Jan 31 '22 at 21:27
  • AWS Amazon MQ also recommends running Celery with --without-hearbeat, but it's not clear from the docs that they really use a low TCP keepalive: https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/best-practices-rabbitmq.html – fjsj Aug 17 '23 at 00:20
4

Celery workers started up with the --without-mingle option, as @ofirule mentioned above, will not receive synchronization data from other workers, particularly revoked tasks. So if you revoke a task, all workers currently running will receive that broadcast and store it in memory so that when one of them eventually picks up the task from the queue, it will not execute it:

https://docs.celeryproject.org/en/stable/userguide/workers.html#persistent-revokes

But if a new worker starts up before that task has been dequeued by a worker that received the broadcast, it doesn't know to revoke the task. If it eventually picks up the task, then the task is executed. You will see this behavior if you're running in an environment where you are dynamically scaling in and out celery workers constantly.

4

I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The documentation referenced above only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?

I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality.

user1847
  • 3,571
  • 1
  • 26
  • 35
  • I think you should post a new question addressing this issue. The broker will reconnect without the heartbeat event, but the heartbeat event is not just checking the connection. It checks that events are sent and received which is a much greater indicator that the app is running as expected. And you may have some use-cases where you want to use that. Like when you have multiple brokers and you want the worker to move to a new broker when events start to drop. – ofirule Apr 06 '21 at 16:16
  • Done, @ofirule! https://stackoverflow.com/questions/66978028/application-impacts-of-celery-workers-running-with-the-without-heartbeat-fla – user1847 Apr 07 '21 at 00:11
0

Adding to the above answers, setting --without-heartbeat will show your worker as "Offline" on the flower dashboard, if you are using that.

tenticon
  • 2,639
  • 4
  • 32
  • 76