I cannot get a Pulsar cluster to restart properly

Question

I am currently working with Apache Pulsar, installed from a helm chart on a local Minikube cluster. The install goes just fine and Apache Pulsar runs well. However, whenever I shutdown/restart my laptop, I can never get the pods all running again. I always get the CrashLoopBackOff status. I try and restart the Pulsar cluster using the following command upon restarting my machine (minikube start):

xyz-MBP:~ xyz$ minikube start
  minikube v1.23.2 on Darwin 11.4
  Kubernetes 1.22.2 is now available. If you would like to upgrade, specify: --kubernetes-version=v1.22.2
✨  Using the docker driver based on existing profile
  Starting control plane node minikube in cluster minikube
  Pulling base image ...
  Restarting existing docker container for "minikube" ...
  Preparing Kubernetes v1.19.0 on Docker 20.10.8 ...
  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image kubernetesui/dashboard:v2.3.1
    ▪ Using image kubernetesui/metrics-scraper:v1.0.7
  Enabled addons: storage-provisioner, default-storageclass, dashboard

❗  /usr/local/bin/kubectl is version 1.22.0, which may have incompatibilites with Kubernetes 1.19.0.
    ▪ Want kubectl v1.19.0? Try 'minikube kubectl -- get pods -A'
  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Now, it looks like it started okay but then when I go to query the status of the pods sometime later, I get the following:

xyz-MBP:pulsar xyz$ kubectl get pods -n pulsar
NAME                                         READY   STATUS             RESTARTS   AGE
pulsar-mini-bookie-0                         0/1     CrashLoopBackOff   8          25h
pulsar-mini-bookie-init-kqx6j                0/1     Completed          0          25h
pulsar-mini-broker-0                         0/1     CrashLoopBackOff   8          25h
pulsar-mini-grafana-555cf54cf-jl5xp          1/1     Running            1          25h
pulsar-mini-prometheus-5556dbb8b8-k5v2v      1/1     Running            1          25h
pulsar-mini-proxy-0                          0/1     Init:1/2           1          25h
pulsar-mini-pulsar-init-h78xk                0/1     Completed          0          25h
pulsar-mini-pulsar-manager-6c6889dff-r6tmk   1/1     Running            1          25h
pulsar-mini-toolset-0                        1/1     Running            1          25h
pulsar-mini-zookeeper-0                      1/1     Running            1          25h

The mini-proxy never gets out of the init stage, and the bookie and broker keep retrying and instantly going into CrashLoopBackOff. Then, when digging into the logs for the Bookie pod I see the following unfamiliar exception:

01:15:10.164 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Cookie for this bookie is not stored in metadata store. Bookie failing to come up
01:15:10.170 [main] ERROR org.apache.bookkeeper.server.Main - Failed to build bookie server

Additionally, I get an exception from the broker pod:

01:21:44.733 [main-EventThread] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to pulsar-mini-bookie-0.pulsar-mini-bookie.pulsar.svc.cluster.local:3181 as endpopint resolution failed

There is more to the above error but didn't want to dump the entire log here. The above error is the first one that shows up, I believe anything that follows is just fallout from the above... let me know if I'm mistaken about that!

Tatikonda vamsikrishna · Answer 1 · 2021-10-26T12:53:36.857

Solution:

You can check whether your application is exiting due the application crashing which is resulting in the issue. Running the following command should give you a sense of whether that is happening in the container logs:

kubectl logs pod-name --all-containers=true

If you enable the stackdriver logging the following filters can be used to obtain container logs:

Stackdriver V1:

resource.type="container"

resource.labels.pod_id="$POD_NAME"

Stackdriver V2:

resource.type="k8s_container"

resource.labels.pod_name="$POD_NAME"

You can check whether your liveness probes are failing which is resulting in crashes. These would be in pod events. Running the following commands should give you a sense of whether that is happening:

kubectl describe pod "$POD_NAME"

If the you have stackdriver logging the following filters can be used to get pod event logs :

Stackdriver V1:

resource.type="gke_cluster"

logName="projects/$PROJECT_ID/logs/events"

jsonPayload.reason="Unhealthy"

jsonPayload.involvedObject.name="$POD_NAME"

Stackdriver V2:

resource.type="k8s_pod"

logName="projects/$PROJECT_ID/logs/events"

jsonPayload.reason="Unhealthy"

resource.labels.pod_name="$POD_NAME"

Root cause of this issue: The pod simply stuck in a loop of starting and crashing.

I am not able to find anything in that link that gives me an idea of how to perform the steps you have outlined. Can you please be a little more specific? — Snoop, Oct 23 '21 at 20:13
@Snoop, I have updated my answer. Check and let me know if this solution solves your issues. — Tatikonda vamsikrishna, Oct 26 '21 at 12:57
If you're using the old Apache pulsar version, Can you try to upgrade your Apache pulsar to the latest version (2.8.1) and try again. — Tatikonda vamsikrishna, Oct 27 '21 at 12:07

score 0 · Answer 2 · answered Oct 21 '21 at 02:07

0

The cluster needs to be shut down in a particular order or you can corrupt your data. Try this:

Shut down brokers first. If you have a proxy, shut that down before the brokers.
Shut down the BookKeeper nodes.
Shut down the Zookeeper nodes.

If this doesn't help, I'd need to see logs to get a better understanding of why the containers aren't starting.

answered Oct 21 '21 at 02:07

devinbost

4,658
2
44
57

when you say we need to shut things down, what command is that? Is that a kubectl command to shutdown the pods, or something else? – Snoop Oct 23 '21 at 20:09
You can stop or destroy the pods. (See https://stackoverflow.com/questions/54821044/how-to-stop-pause-a-pod-in-kubernetes ) If you do this, make sure that you're correctly persisting the data to ensure that it's not deleted also if you remove the BookKeeper or Zookeeper pods. Honestly, judging by your message, it looks like you might benefit from working through some video tutorials on the basics of Kubernetes. Without that background knowledge, you're going to have a tough time operating not just Pulsar but any distributed technology on Kubernetes. – devinbost Oct 24 '21 at 04:49

I cannot get a Pulsar cluster to restart properly

2 Answers2