Is there a more indepth query or endpoint I can hit for Kafka to tell me it is all caught up with partitions, it has rejoined the cluster and it can talk to all listed brokers and Zk nodes - and that there are no Java exceptions in the logs?
Perhaps there is one key log entry I can look for? Something like [GroupCoordinator ####]: Assignment received from leader for group X
? But there are several of these messages as well.
More Details
Currently we use Chef automation for stateful boxes and Kubernetes for our container versions - all with a number of TCP port health checks on startup of when the ports are available, move onto the next node.
However, we've seen nodes ready their port - well before they are done catching up or moving partitions around. This is a problem because when we have Replica factor set to 3, and say 3 nodes in a row are restarted, those partitions could be lost if data is being received during the time. Or other partials such as only 1 out of the 3 replicas being available, so that node starts replicating onto other nodes - while the others come back up (the timing of this seems completely random as some partitions are used far more than others). Or a multitude of other conditions that kept it from being healthy, such as java exceptions of not talking to Zk, SSL certs expires, Fetcher issues, etc.
For example, today I am monitoring the logs for when all 50+ ReplicaFetcherThreads have shutdown before moving onto the next node. In this one pop, that timing is about 2-5 minutes. However, in another pop that can take up to 20 minutes!
Configuration
We run various configurations but most have these type of settings (with various tuning):
controlled.shutdown.enable=true
controlled.shutdown.max.retries=3
controlled.shutdown.retry.backup.ms=5000
default.replication.factor=3
group.max.session.timeout.ms=60000
group.min.session.timeout.ms=10000
num.partitions=5
num.replica.fetchers=4
session.timeout.ms=30000
Kafka 0.10.2.1 for now as it's difficult to update with the number of servers and subscribers we have (over 200+ across 1/2 dozen PoPs). However, if someone can show that newer versions have some type of Health or Status endpoints or way to query for the server's health and call caught up with all partitions, we'll make the effort to upgrade to that version.
External Tools?
We also run Kafka Manager as well as Burrow in most pops. Perhaps those have an API I can query those for the complete health state of a particular node?
Bonus: Monitor Under-Replication of Topics
Perhaps also check for topics/partitions that are under-replicated using these tools? And if the threshold falls too low, pause until the replica count comes back up before continuing with the rolling restarts.