6

The problem

I have a RabbitMQ Server that serves as a queue hub for one of my systems. In the last week or so, its producers come to a complete halt every few hours.

What have I tried

Brute force

  • Stopping the consumers releases the lock for a few minutes, but then blocking returns.
  • Restarting RabbitMQ solved the problem for a few hours.
  • I have some automatic script that does the ugly restarts, but it's obviously far from a proper solution.

Allocating more memory

Following cantSleepNow's answer, I have increased the memory allocated to RabbitMQ to 90%. The server has a whopping 16GB of memory and the message count is not very high (millions per day), so that does not seem to be the problem.

From the command line:

sudo rabbitmqctl set_vm_memory_high_watermark 0.9

And with /etc/rabbitmq/rabbitmq.config:

[
   {rabbit,
   [
     {loopback_users, []},
     {vm_memory_high_watermark, 0.9}
   ]
   }
].

Code & Design

I use Python for all consumers and producers.

Producers

The producers are API server that serve calls. Whenever a call arrives, a connection is opened, a message is sent and the connection is closed.

from kombu import Connection

def send_message_to_queue(host, port, queue_name, message):
    """Sends a single message to the queue."""
    with Connection('amqp://guest:guest@%s:%s//' % (host, port)) as conn:
        simple_queue = conn.SimpleQueue(name=queue_name, no_ack=True)
        simple_queue.put(message)
        simple_queue.close()

Consumers

The consumers slightly differ from each other, but generally use the following pattern - opening a connection, and waiting on it until a message arrives. The connection can stay opened for long period of times (say, days).

with Connection('amqp://whatever:whatever@whatever:whatever//') as conn:
    while True:
        queue = conn.SimpleQueue(queue_name)
        message = queue.get(block=True)
        message.ack()

Design reasoning

  • Consumers always need to keep an open connection with the queue server
  • The Producer session should only live during the lifespan of the API call

This design had caused no problems till about one week ago.

Web view dashboard

The web console shows that the consumers in 127.0.0.1 and 172.31.38.50 block the consumers from 172.31.38.50, 172.31.39.120, 172.31.41.38 and 172.31.41.38.

Blocking / Blocked queues

System metrics

Just to be on the safe side, I checked the server load. As expected, the load average and CPU utilization metrics are low.

enter image description here

Why does the rabbit MQ each such a deadlock?

Community
  • 1
  • 1
Adam Matan
  • 128,757
  • 147
  • 397
  • 562
  • I didn't undestand this part `the consumers reuse the same connection while waiting on new messages`. How does the producer get the connection back? Can the connection be used only by a producer or (xor really) a consumer at a single moment? – cantSleepNow Jun 05 '16 at 08:03
  • @cantSleepNow Thanks! Clarifying here and in the questions -1. Each consumer open a connection, than get messages from it. For the entire lifespan of the consumer - which can be days - it uses the same connection. 2. Each producer waits for an API call. When a new call arrives, it opens a connection, writes data to it, and closes it immediately after. – Adam Matan Jun 05 '16 at 08:09
  • ok so there is no connection sharing and each consumer and producer are separate processes? Can you somehow determine what happens when the producer closes the connection? Also, are you using blocking connection or select (see this question for reference http://stackoverflow.com/questions/11987838/which-form-of-connection-to-use-with-pika) EDIT sorry for some reason I can't do @ reply tag... – cantSleepNow Jun 05 '16 at 08:28
  • @cantSleepNow 1. Each consumer and producer is a separate process (or uWSGI thread, for that matters) - they share no resource between them. At the deadlock, the producer can't close the connection, it gets stuck at the `simple_queue.put()` call. 3. I'm using blocking connections for the producers - the queue calls return very quickly. 4. No need to @ for the OP, I get notified for your comments. – Adam Matan Jun 05 '16 at 08:54
  • 1
    @AdamMatan can you post the logs? when you have the connection blocked / blocking, most likely, is for some RabbitMQ alarm. Which version are you using ? – Gabriele Santomaggio Jun 17 '16 at 07:25
  • 1
    I ran across this blog this morning http://blog.domanski.me/rapid-rabbitmq. The author had a similar situation. His conclusion was that the management plugin enabled some code inside of RabbitMQ that caused the flow control to behave poorly and everything in the queue ended up blocked. In his case disabling the management plugin fixed it. – Brad Campbell Jun 17 '16 at 16:37
  • @Gabriele I'm using 'RabbitMQ 3.6.2, Erlang R16B03'. The logs indicated memory problems. I don't have the logs from the previous failure (they rotate quite quickly), but when it fails again I will quote them in the question. – Adam Matan Jun 19 '16 at 10:03
  • @BradCampbell How weird. Could it be that the web plugin causes all the trouble? Will try the solution next time the server fails. – Adam Matan Jun 19 '16 at 10:03

2 Answers2

4

This is most likely caused by a memory leak in the management module for RabbitMQ 3.6.2. This has now been fixed in RabbitMQ 3.6.3, and is available here.

The issue itself is described here, but is also discussed extensively on the RabbitMQ messages boards; for example here and here. This has also been known to cause a lot of weird issues, a good example is the issue reported here.

As a temporary fix until the new version is released, you can either upgrade to the new est build, downgrade to 3.6.1 or completely disable the management module.

eandersson
  • 25,781
  • 8
  • 89
  • 110
  • 1
    I have just downgraded to 3.6.1 (3.6.3 has a dependency issue on Ubuntu 14.04 LTS), can't wait to see if it works. – Adam Matan Jun 19 '16 at 10:56
  • Let me know if you have any further questions. – eandersson Jun 19 '16 at 17:20
  • 1
    Sure. Your home address. My wife will send you flowers, my boss will send you pizza and beer and I will send you a bottle of fine liquor. I had two nights of uninterrupted sleep (well, except for the kids). Good going. – Adam Matan Jun 21 '16 at 08:59
  • 1
    And by the way, I can't believe [RabbitMQ still features 3.6.2](https://www.rabbitmq.com/download.html) on their site. Everybody uses the web plugin, and it's seriously poisonous. – Adam Matan Jun 21 '16 at 09:01
  • @AdamMatan: This still working after the downgrade? 3.6.3 is officially released by the way! – eandersson Jul 22 '16 at 09:18
  • Worked like a charm. After the bad experience with 3.6.2, I will probably refrain from upgrading unless I have a very good reason. – Adam Matan Jul 22 '16 at 10:31
1

I'm writing this as an answer, partially because it may help and partially because it's too large to be a comment.

First I'm sorry for missing this message = queue.get(block=True). Also a disclaimer - I'm not familiar with python nor PIKA API.

AMQP's basic.get is actually synchronous and you are setting the block=true. As I said, don't know what this means in PIKA, but in combination with constantly pooling the queue, doesn't sound efficient. So it could be that for what ever reason, publisher get's denied a connection due to queue access being blocked by the consumer. It actually fits perfectly with how you temporally resolve the issue by Stopping the consumers releases the lock for a few minutes, but then blocking returns.

I'd recommend trying with AMQP's basic.consume instead of basic.get. I don't know what is the motivation for get, but in most of cases (my experience anyway) you should go with consume. Just to quote from the aforementioned link

This method provides a direct access to the messages in a queue using a synchronous dialogue that is designed for specific types of application where synchronous functionality is more important than performance.

In RabbitMQ docs it says the connection gets blocked when the broker is low on resources, but as you wrote the load is quite low. Just to be safe, you may check memory consumption and free disk space.

Eilyre
  • 466
  • 3
  • 8
cantSleepNow
  • 9,691
  • 5
  • 31
  • 42
  • Thanks a lot for the answer! `consume()` seems to partially supported, while `get()` is fully supported; I don't see any `consume()` method in `SimpleQueue`. However, it makes sense for a consumer to block on a queue when there's no messages in it; the blocking applies to the reading, not the writing, end - unless I'm missing something here. – Adam Matan Jun 05 '16 at 12:07
  • @AdamMatan You are welcome. I don't know what do you need from these methods, but in the tutorial consume is used, so I'm assuming it works https://www.rabbitmq.com/tutorials/tutorial-one-python.html Anyhow, please let me know (I'm sure other's are interested as well) how it all turns out. – cantSleepNow Jun 05 '16 at 12:12
  • I've increased the memory (updated the "What have I tried" section). The server keeps failing on a daily basis. – Adam Matan Jun 19 '16 at 10:01