Kafka machines are installed as part of hortonworks packages , kafka
version is 0.1X
We run the deeg_data
applications, consuming data from kafka
topics
On last days we saw that our application – deeg_data
are failed and we start to find the root cause
On kafka
cluster we see the following behavior
/usr/hdp/current/kafka-broker/bin/kafka-consumer-groups.sh --group deeg_data --describe --bootstrap-server kafka1:6667
To enable GC log rotation, use -Xloggc:<filename> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=<num_of_files>
where num_of_file > 0
GC log rotation is turned off
Consumer group ‘deeg_data’ is rebalancing
from kafka
side kafka
cluster is healthy and all topics are balanced and all kafka brokers are up and signed correctly to zookeeper
After some time ( couple hours ) , we run again the following , but without the errors about - Consumer group ‘deeg_data’ is rebalancing
And we get the following correctly results
/usr/hdp/current/kafka-broker/bin/kafka-consumer-groups.sh --group deeg_data --describe --bootstrap-server kafka1:6667
To enable GC log rotation, use -Xloggc:<filename> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=<num_of_files>
where num_of_file > 0
GC log rotation is turned off
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
deeg_data pot.sdr.proccess 0 6397256247 6403318505 6062258 consumer-1_/10.3.6.237
deeg_data pot.sdr.proccess 1 6397329465 6403390955 6061490 consumer-1_/10.3.6.237
deeg_data pot.sdr.proccess 2 6397314633 6403375153 6060520 consumer-1_/10.3.6.237
deeg_data pot.sdr.proccess 3 6397258695 6403320788 6062093 consumer-1_/10.3.6.237
deeg_data pot.sdr.proccess 4 6397316230 6403378448 6062218 consumer-1_/10.3.6.237
deeg_data pot.sdr.proccess 5 6397325820 6403388053 6062233 consumer-1_/10.3.6.237.
.
.
.
So we want to understand why we get:
Consumer group ‘deeg_data’ is rebalancing
What is the reason for above state , and why we get rebalancing
we also have good post (https://www.confluent.io/blog/kafka-consumer-multi-threaded-messaging/)
Group rebalancing Consumer group rebalancing is triggered when partitions need to be reassigned among consumers in the consumer group: A new consumer joins the group; an existing consumer leaves the group; an existing consumer changes subscription; or partitions are added to one of the subscribed topics.
Rebalancing is orchestrated by the group coordinator and it involves communication with all consumers in the group. To dive deeper into the consumer group rebalance protocol, see Everything You Always Wanted to Know About Kafka’s Rebalance Protocol But Were Afraid to Ask by Matthias J. Sax from Kafka Summit and The Magical Rebalance Protocol of Apache Kafka by Gwen Shapira.
Regarding consumer client code, some of the partitions assigned to it might be revoked during a rebalance. In the older version of the rebalancing protocol, called eager rebalancing, all partitions assigned to a consumer are revoked, even if they are going to be assigned to the same consumer again. With the newer protocol version, incremental cooperative rebalancing, only partitions that are reassigned to another consumer will be revoked. You can learn more about the new rebalancing protocol in this blog post by Konstantine Karantasis and this blog post by Sophie Blee-Goldman.
Regardless of protocol version, when a partition is about to be revoked, the consumer has to make sure that record processing is finished and the offset is committed for that partition before informing the group coordinator that the partition can be safely reassigned.
With automatic offset commit enabled in the thread per consumer model, you don’t have to worry about group rebalancing. Everything is done by the poll method automatically. However, if you disable automatic offset commit and commit manually, it’s your responsibility to commit offsets before the join group request is sent. You can do this in two ways:
Note - also good post is from you-tube - https://www.youtube.com/watch?v=QaeXDh12EhE
Note - good stack-overflow post - Kafka Consumer Rebalancing takes too long
Note - from ENV side , since our zookeeper servers are installed on VM machines and VM machine are using non ssd disks , and regarding to swap consuming , then I think we need to consider also the post - https://community.cloudera.com/t5/Community-Articles/Zookeeper-Sizing-and-Placement/ta-p/247885