Why is AWS MSK Kafka broker constantly disconnecting and reconnecting the consumer group

Question

I have AWS MSK Kafka cluster with 2 brokers. From the logs I can see (on each broker) that they are constantly rebalancing. Every minute I can see in logs:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350887 (__consumer_offsets-21) (reason: Adding new member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 with group instance id None) (kafka.coordinator.group.GroupCoordinator)

And 25 seconds later:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350888 (__consumer_offsets-21) (reason: removing member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)

Why this happens? What is causing it? And what is amazon.msk.canary.group.broker-1 consumer group?

Sorry I can't help you with why it is rebalancing and dropping, but the canary consumer group is what AWS uses to monitor health and metrics on the Kafka Cluster: https://docs.aws.amazon.com/msk/latest/developerguide/troubleshooting.html#amazon_msk_canary Are there no other server logs? Kafka has 2 logs by default, the message/topic logs, and the internal application logs. I believe the default internal application logs are located at: $kafka_Home/kafka/logs/ Might have to do some digging around to find the right logs. — Wobbley, Oct 12 '21 at 09:18
I'm experiencing the same behaviour on my MSK clusters. Did you end up finding the cause? — borgespires, Mar 15 '22 at 15:18
Did you find any explanation for this? We experience the same symptom in a 3 m5.large broker cluster running Kafka 2.8.1. The cluster will run fine for a few days, then occasionally the amazon.msk.canary.group.broker-N groups for all 3 brokers go into a tight rebalance loop. The only solution is to do a rolling restart of all brokers in the cluster. — Dude0001, Jul 13 '22 at 16:54
We have the same issue and the response from AWS was "This from internal consumer groups managed by MSK. Amazon MSK creates and uses the following internal topics: __amazon_msk_canary and __amazon_msk_canary_state for cluster health and diagnostic metrics. Consumer groups (amazon.msk.canary*) shown in the logs are MSK's internal, therefore you do not need to be worried about them." — Joe M, Aug 31 '22 at 01:26

score 0 · Answer 1 · answered Nov 03 '21 at 13:29

0

May it be something with the configuration of Java’s garbage collection on the brokers? I remember reading that a misconfiguration of the garbage collectors can cause the broker to pause for a few seconds and lose connectivity to the Zookeeper, hence the flipping behavior. Could you check whether you are applying any custom configuration for garbage collection? (i.e. via KAFKA_JVM_PERFORMANCE_OPTS environmental variable)

answered Nov 03 '21 at 13:29

Ece Tavasli

101
1
3

1

We have no control of such variables at all. All we can do is set some kafka configuration for MSK or increase number of brokers. – amorfis Nov 04 '21 at 14:55

Why is AWS MSK Kafka broker constantly disconnecting and reconnecting the consumer group

1 Answers1