RabbitMQ fails to start after restart Kubernetes cluster

Question

I'm running RabbitMQ on Kubernetes. This is my sts YAML file:

apiVersion: v1
kind: Service
metadata:
  name: rabbitmq-management
  labels:
    app: rabbitmq
spec:
  ports:
  - port: 15672
    name: http
  selector:
    app: rabbitmq
  type: NodePort
---
apiVersion: v1
kind: Service
metadata:
  name: rabbitmq
  labels:
    app: rabbitmq
spec:
  ports:
  - port: 5672
    name: amqp
  - port: 4369
    name: epmd
  - port: 25672
    name: rabbitmq-dist
  - port: 61613
    name: stomp
  clusterIP: None
  selector:
    app: rabbitmq
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
spec:
  serviceName: "rabbitmq"
  replicas: 3
  selector:
    matchLabels:
      app: rabbitmq
  template:
    metadata:
      labels:
        app: rabbitmq
    spec:
      containers:
      - name: rabbitmq
        image: rabbitmq:management-alpine
        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c
              - >
                rabbitmq-plugins enable rabbitmq_stomp;
                if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
                  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
                  cat /etc/resolv.conf.new > /etc/resolv.conf;
                  rm /etc/resolv.conf.new;
                fi;
                until rabbitmqctl node_health_check; do sleep 1; done;
                if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
                  rabbitmqctl stop_app;
                  rabbitmqctl join_cluster rabbit@rabbitmq-0;
                  rabbitmqctl start_app;
                fi;
                rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
        env:
        - name: RABBITMQ_ERLANG_COOKIE
          valueFrom:
            secretKeyRef:
              name: rabbitmq-config
              key: erlang-cookie
        ports:
        - containerPort: 5672
          name: amqp
        - containerPort: 61613
          name: stomp
        volumeMounts:
        - name: rabbitmq
          mountPath: /var/lib/rabbitmq
  volumeClaimTemplates:
  - metadata:
      name: rabbitmq
      annotations:
        volume.alpha.kubernetes.io/storage-class: do-block-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

and I created the cookie with this command:

kubectl create secret generic rabbitmq-config --from-literal=erlang-cookie=c-is-for-cookie-thats-good-enough-for-me

all of my Kubernetes cluster nodes are ready:

kubectl get nodes
NAME                 STATUS   ROLES    AGE   VERSION
kubernetes-master    Ready    master   14d   v1.17.3
kubernetes-slave-1   Ready    <none>   14d   v1.17.3
kubernetes-slave-2   Ready    <none>   14d   v1.17.3

but after restarting the cluster, the RabbitMQ didn't start. I tried to scale down and up the sts but the problem already exist. The output of kubectl describe pod rabbitmq-0:

kubectl describe pod rabbitmq-0
Name:         rabbitmq-0
Namespace:    default
Priority:     0
Node:         kubernetes-slave-1/192.168.0.179
Start Time:   Tue, 24 Mar 2020 22:31:04 +0000
Labels:       app=rabbitmq
              controller-revision-hash=rabbitmq-6748869f4b
              statefulset.kubernetes.io/pod-name=rabbitmq-0
Annotations:  <none>
Status:       Running
IP:           10.244.1.163
IPs:
  IP:           10.244.1.163
Controlled By:  StatefulSet/rabbitmq
Containers:
  rabbitmq:
    Container ID:  docker://d5108f818525030b4fdb548eb40f0dc000dd2cec473ebf8cead315116e3efbd3
    Image:         rabbitmq:management-alpine
    Image ID:      docker-pullable://rabbitmq@sha256:6f7c8d01d55147713379f5ca26e3f20eca63eb3618c263b12440b31c697ee5a5
    Ports:         5672/TCP, 61613/TCP
    Host Ports:    0/TCP, 0/TCP
    State:         Waiting
      Reason:      PostStartHookError: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
  cat /etc/resolv.conf.new > /etc/resolv.conf;
  rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
  rabbitmqctl stop_app;
  rabbitmqctl join_cluster rabbit@rabbitmq-0;
  rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit@rabbitmq-0'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@rabbitmq-0
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools


DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-575-rabbit@rabbitmq-0'
 * effective user's home directory: /var/lib/rabbitmq
 * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==

Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
  node_health_check

Usage

rabbitmqctl [--node <node>] [--longnames] [--quiet] node_health_check [--timeout <timeout>]
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-10397-rabbit@rabbitmq-0'
 * effective user's home directory: /var/lib/rabbitmq
 * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==


    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 24 Mar 2020 22:46:09 +0000
      Finished:     Tue, 24 Mar 2020 22:58:28 +0000
    Ready:          False
    Restart Count:  1
    Environment:
      RABBITMQ_ERLANG_COOKIE:  <set to the key 'erlang-cookie' in secret 'rabbitmq-config'>  Optional: false
    Mounts:
      /var/lib/rabbitmq from rabbitmq (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bbl9c (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  rabbitmq:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  rabbitmq-rabbitmq-0
    ReadOnly:   false
  default-token-bbl9c:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bbl9c
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                  From                         Message
  ----     ------                  ----                 ----                         -------
  Normal   Scheduled               31m                  default-scheduler            Successfully assigned default/rabbitmq-0 to kubernetes-slave-1
  Normal   Pulled                  31m                  kubelet, kubernetes-slave-1  Container image "rabbitmq:management-alpine" already present on machine
  Normal   Created                 31m                  kubelet, kubernetes-slave-1  Created container rabbitmq
  Normal   Started                 31m                  kubelet, kubernetes-slave-1  Started container rabbitmq
   Normal   SandboxChanged          16m (x9 over 17m)    kubelet, kubernetes-slave-1  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  3m58s (x2 over 16m)  kubelet, kubernetes-slave-1  Container image "rabbitmq:management-alpine" already present on machine
  Warning  FailedPostStartHook     3m58s                kubelet, kubernetes-slave-1  Exec lifecycle hook ([/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
  cat /etc/resolv.conf.new > /etc/resolv.conf;
  rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
  rabbitmqctl stop_app;
  rabbitmqctl join_cluster rabbit@rabbitmq-0;
  rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_default(2e561153-a830-4d30-ab1e-71c80d10c9e9)" failed - error: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
  cat /etc/resolv.conf.new > /etc/resolv.conf;
  rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
  rabbitmqctl stop_app;
  rabbitmqctl join_cluster rabbit@rabbitmq-0;
  rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit@rabbitmq-0'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@rabbitmq-0
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  other nodes on rabbitmq-0: [rabbitmqprelaunch1]
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-433-rabbit@rabbitmq-0'
 * effective user's home directory: /var/lib/rabbitmq
 * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==

Error: unable to perform an operation on node 'rabbit@rabbitmq-0'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@rabbitmq-0
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-575-rabbit@rabbitmq-0'
 * effective user's home directory: /var/lib/rabbitmq
 * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==

Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
  node_health_check

, message: "Enabling plugins on node rabbit@rabbitmq-0:\nrabbitmq_stomp\nThe following plugins have been configured:\n  rabbitmq_management\n  rabbitmq_management_agent\n  rabbitmq_stomp\n  rabbitmq_web_dispatch\nApplying plugin configuration to rabbit@rabbitmq-0...\nThe following plugins have been enabled:\n  rabbitmq_stomp\n\nset 4 plugins.\nOffline change; changes will take effect at broker restart.\nTimeout: 70 seconds ...\nChecking health of node rabbit@rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit@rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit@rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit@rabbitmq-0 ...\nTimeout: 70 seconds ...\
Error:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError: unable to perform an operation on node 'rabbit@rabbitmq-0'. Please see diagnostics information and suggestions below.\n\nMost common reasons for this are:\n\n * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)\n * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)\n * Target node is not running\n\nIn addition to the diagnostics info below:\n\n * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more\n * Consult server logs on node rabbit@rabbitmq-0\n * If target node is configured to use long node names, don't forget to use --longnames with CLI tools\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@rabbitmq-0']\n\nrabbit@rabbitmq-0:\n  * connected to epmd (port 4369) on rabbitmq-0\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on rabbitmq-0\n  * suggestion: start the node\n\nCurrent node details:\n * node name: 'rabbitmqcli-10397-rabbit@rabbitmq-0'\n * effective user's home directory: /var/lib/rabbitmq\n * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\n"
  Normal  Killing  3m58s                kubelet, kubernetes-slave-1  FailedPostStartHook
  Normal  Created  3m57s (x2 over 16m)  kubelet, kubernetes-slave-1  Created container rabbitmq
  Normal  Started  3m57s (x2 over 16m)  kubelet, kubernetes-slave-1  Started container rabbitmq

The output of kubectl get sts:

kubectl get sts
NAME        READY   AGE
consul      3/3     15d
hazelcast   2/3     15d
kafka       2/3     15d
rabbitmq    0/3     13d
zk          3/3     15d

and this is pod log that I copied from Kubernetes dashboard:

2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: list of feature flags found:
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags:   [x] drop_unroutable_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags:   [x] empty_basic_get_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags:   [x] implicit_default_bindings
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags:   [x] quorum_queue
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags:   [x] virtual_host_metadata
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2020-03-24 22:58:43.979 [info] <0.319.0> ra: meta data store initialised. 0 record(s) recovered
2020-03-24 22:58:43.980 [info] <0.324.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0/quorum/rabbit@rabbitmq-0/00000262.wal"]
2020-03-24 22:58:43.982 [info] <0.328.0> 
 Starting RabbitMQ 3.8.2 on Erlang 22.2.8
 Copyright (c) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL 1.1. Website: https://rabbitmq.com

  ##  ##      RabbitMQ 3.8.2
  ##  ##
  ##########  Copyright (c) 2007-2019 Pivotal Software, Inc.
  ######  ##
  ##########  Licensed under the MPL 1.1. Website: https://rabbitmq.com

  Doc guides: https://rabbitmq.com/documentation.html
  Support:    https://rabbitmq.com/contact.html
  Tutorials:  https://rabbitmq.com/getstarted.html
  Monitoring: https://rabbitmq.com/monitoring.html

  Logs: <stdout>

  Config file(s): /etc/rabbitmq/rabbitmq.conf

  Starting broker...2020-03-24 22:58:43.983 [info] <0.328.0> 
 node           : rabbit@rabbitmq-0
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : P1XNOe5pN3Ug2FCRFzH7Xg==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step pre_boot defined by app rabbit
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step rabbit_core_metrics defined by app rabbit
2020-03-24 22:58:43.998 [info] <0.328.0> Running boot step rabbit_alarm defined by app rabbit
2020-03-24 22:58:44.002 [info] <0.334.0> Memory high watermark set to 1200 MiB (1258889216 bytes) of 3001 MiB (3147223040 bytes) total
2020-03-24 22:58:44.014 [info] <0.336.0> Enabling free disk space monitoring
2020-03-24 22:58:44.014 [info] <0.336.0> Disk free limit set to 50MB
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step code_server_cache defined by app rabbit
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step file_handle_cache defined by app rabbit
2020-03-24 22:58:44.019 [info] <0.339.0> Limiting to approx 1048479 file handles (943629 sockets)
2020-03-24 22:58:44.019 [info] <0.340.0> FHC read buffering:  OFF
2020-03-24 22:58:44.019 [info] <0.340.0> FHC write buffering: ON
2020-03-24 22:58:44.020 [info] <0.328.0> Running boot step worker_pool defined by app rabbit
2020-03-24 22:58:44.021 [info] <0.329.0> Will use 2 processes for default worker pool
2020-03-24 22:58:44.021 [info] <0.329.0> Starting worker pool 'worker_pool' with 2 processes in it
2020-03-24 22:58:44.021 [info] <0.328.0> Running boot step database defined by app rabbit
2020-03-24 22:58:44.041 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 22:59:14.042 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:14.042 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 22:59:44.043 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:44.043 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:00:14.044 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:14.044 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:00:44.045 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:44.045 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:01:14.046 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:14.046 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:01:44.047 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:44.047 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:02:14.048 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:14.048 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:02:44.049 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:44.049 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:03:14.050 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:03:14.050 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:03:44.051 [error] <0.328.0> Feature flag `quorum_queue`: migration function crashed: {error,{timeout_waiting_for_tables,[rabbit_durable_queue]}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,117}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,60}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1486}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2128}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2126}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,1947}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,631}]}]
2020-03-24 23:03:44.051 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 23:04:14.052 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:14.052 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 23:04:44.053 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:44.053 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:05:14.055 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:14.055 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:05:44.056 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:44.056 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:06:14.057 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:14.057 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:06:44.058 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:44.058 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:07:14.059 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:14.059 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:07:44.060 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:44.060 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:08:14.061 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:08:14.061 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:08:44.062 [error] <0.327.0> CRASH REPORT Process <0.327.0> with 0 neighbours exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2020-03-24 23:08:44.063 [info] <0.43.0> Application rabbit exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_r

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Hi Amir, it would be a good idea post this on the kubernetes-users slack channel, have you signed up for that? Also it would be useful if you provided a reference to any guide that you are following to set this cluster up. — Rob Kielty, Mar 23 '20 at 11:32
Can you please post your `kubectl get pods` ? when the app says the node is not running it's probably refering to rabbidmq nodes (pods). just to check if they are running before we dive into conclusions. — Will R.O.F., Mar 23 '20 at 14:56
@RobKielty I'm not joined to that channel, how I can join? I up and run my cluster with this guide: https://vitux.com/install-and-deploy-kubernetes-on-ubuntu/ — Amir Soleimani Borujerdi, Mar 23 '20 at 21:21
@willrof Sorry I can't add `kubectl get pods` here because of limitation on the number of characters, I attached the output in my question after the command I entered for creating Erlang cookie. — Amir Soleimani Borujerdi, Mar 23 '20 at 21:30
@AmirSoleimani sure, on the comments is not recommended. Make an edit to your original question and add to the end the output of the command. — Will R.O.F., Mar 23 '20 at 22:58
@AmirSoleimani It took me some time to realize that was a "normal" `kubectl get pods`. I'm researching this issue, but since it's with an error `PostStartHookError` it would be valuable to get the output of `kubectl describe pod rabbitmq-0` I'm waiting for your reply. — Will R.O.F., Mar 24 '20 at 17:20
@willrof There is a limitation on the number of characters in the question section too, I had to remove some lines of the outputs that I think those weren't helpful. — Amir Soleimani Borujerdi, Mar 24 '20 at 23:39
Hi Amir, could you please share how did you install rabbitmq-ha in your cluster (installation source, links), from what I can see from logs it seems to be app specific problem. There are various implementations of rabbitmq for Kubernetes platform, so it would be good set workload specific context for your issue, for further troubleshooting. — Nepomucen, Mar 25 '20 at 11:23
@Nepomucen I used this https://wesmorgan.svbtle.com/rabbitmq-cluster-on-kubernetes-with-statefulsets and I added `rabbitmq-plugins enable rabbitmq_stomp;` because I need stomp for our project. — Amir Soleimani Borujerdi, Mar 25 '20 at 13:25
Possible duplicate: https://stackoverflow.com/questions/60407082/rabbit-mq-error-while-waiting-for-mnesia-tables — islamhamdi, Oct 15 '20 at 11:51
Try by disabled the `rabbitmq-plugins enable rabbitmq_stomp` If should work! — Gupta, Apr 01 '21 at 03:56

score 6 · Accepted Answer · answered Oct 15 '20 at 11:41

6

Take a look at: https://www.rabbitmq.com/clustering.html#restarting

You should be able to stop the app and then force boot:

rabbitmqctl stop_app
rabbitmqctl force_boot

answered Oct 15 '20 at 11:41

Vincent Gerris

7,228
1
24
22

score 2 · Answer 2 · answered Jan 04 '22 at 14:09

To complete @Vincent Gerris answer, I strongly recommend you to use the Bitnami rabbitMQ Docker image.

They have included an env variable called RABBITMQ_FORCE_BOOT:

https://github.com/bitnami/bitnami-docker-rabbitmq/blob/2c38682053dd9b3e88ab1fb305355d2ce88c2ccb/3.9/debian-10/rootfs/opt/bitnami/scripts/librabbitmq.sh#L760

if is_boolean_yes "$RABBITMQ_FORCE_BOOT" && ! is_dir_empty "${RABBITMQ_DATA_DIR}/${RABBITMQ_NODE_NAME}"; then
            # ref: https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot
            warn "Forcing node to start..."
            debug_execute "${RABBITMQ_BIN_DIR}/rabbitmqctl" force_boot
        fi

this will force boot the node at entrypoint.

RabbitMQ fails to start after restart Kubernetes cluster

2 Answers2

Linked