How to check a Kafka node restart is completely healthy before moving onto the next in large clusters?

Question

Is there a more indepth query or endpoint I can hit for Kafka to tell me it is all caught up with partitions, it has rejoined the cluster and it can talk to all listed brokers and Zk nodes - and that there are no Java exceptions in the logs?

Perhaps there is one key log entry I can look for? Something like [GroupCoordinator ####]: Assignment received from leader for group X? But there are several of these messages as well.

More Details

Currently we use Chef automation for stateful boxes and Kubernetes for our container versions - all with a number of TCP port health checks on startup of when the ports are available, move onto the next node.

However, we've seen nodes ready their port - well before they are done catching up or moving partitions around. This is a problem because when we have Replica factor set to 3, and say 3 nodes in a row are restarted, those partitions could be lost if data is being received during the time. Or other partials such as only 1 out of the 3 replicas being available, so that node starts replicating onto other nodes - while the others come back up (the timing of this seems completely random as some partitions are used far more than others). Or a multitude of other conditions that kept it from being healthy, such as java exceptions of not talking to Zk, SSL certs expires, Fetcher issues, etc.

For example, today I am monitoring the logs for when all 50+ ReplicaFetcherThreads have shutdown before moving onto the next node. In this one pop, that timing is about 2-5 minutes. However, in another pop that can take up to 20 minutes!

Configuration

We run various configurations but most have these type of settings (with various tuning):

controlled.shutdown.enable=true
controlled.shutdown.max.retries=3
controlled.shutdown.retry.backup.ms=5000
default.replication.factor=3
group.max.session.timeout.ms=60000
group.min.session.timeout.ms=10000
num.partitions=5
num.replica.fetchers=4
session.timeout.ms=30000

Kafka 0.10.2.1 for now as it's difficult to update with the number of servers and subscribers we have (over 200+ across 1/2 dozen PoPs). However, if someone can show that newer versions have some type of Health or Status endpoints or way to query for the server's health and call caught up with all partitions, we'll make the effort to upgrade to that version.

External Tools?

We also run Kafka Manager as well as Burrow in most pops. Perhaps those have an API I can query those for the complete health state of a particular node?

Bonus: Monitor Under-Replication of Topics

Perhaps also check for topics/partitions that are under-replicated using these tools? And if the threshold falls too low, pause until the replica count comes back up before continuing with the rolling restarts.

You'd need a combination of log gathering and JMX monitoring. The most important things are restart the controller broker last and make sure all replicas are in sync before continuing. Yelp has a Python script on Github that polls Jolokia JMX values and does a rolling restart — OneCricketeer, Sep 03 '18 at 21:23
Yep, i already query for the current controller and make that one last - though with Kubernetes, you don't really have a choice with StatefulSets (it goes in order of 5, 4, 3, 2, 1 and 1, 2, 3, 4, 5, etc). What exactly would i be monitoring for in the logs, in an automated fashion? And thanks for the tip from Yahoo on to query JMX for, i'll look into that as well. — eduncan911, Sep 04 '18 at 04:37
We do rolling restarts occasionally, but just looking for no warnings or errors. The ISR numbers are really the main thing we look at — OneCricketeer, Sep 04 '18 at 11:11

score 0 · Answer 1 · answered Nov 19 '22 at 17:04

I'm following after a few years on my original question.

The short answer is, there isn't a way with older versions of Kafka.

Kafka requires manual monitoring of the logs to confirm the node is up and healthy and sync'd before moving onto the next node. And the logs are large with a large number of different logs to indicate it is ready.

Maybe newer versions of Kafka (2.0+?) has resolved this issue.

How to check a Kafka node restart is completely healthy before moving onto the next in large clusters?

More Details

Configuration

External Tools?

Bonus: Monitor Under-Replication of Topics

1 Answers1