6

I have AWS MSK Kafka cluster with 2 brokers. From the logs I can see (on each broker) that they are constantly rebalancing. Every minute I can see in logs:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350887 (__consumer_offsets-21) (reason: Adding new member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 with group instance id None) (kafka.coordinator.group.GroupCoordinator)

And 25 seconds later:

Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350888 (__consumer_offsets-21) (reason: removing member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)

Why this happens? What is causing it? And what is amazon.msk.canary.group.broker-1 consumer group?

amorfis
  • 15,390
  • 15
  • 77
  • 125
  • Sorry I can't help you with why it is rebalancing and dropping, but the canary consumer group is what AWS uses to monitor health and metrics on the Kafka Cluster: https://docs.aws.amazon.com/msk/latest/developerguide/troubleshooting.html#amazon_msk_canary Are there no other server logs? Kafka has 2 logs by default, the message/topic logs, and the internal application logs. I believe the default internal application logs are located at: $kafka_Home/kafka/logs/ Might have to do some digging around to find the right logs. – Wobbley Oct 12 '21 at 09:18
  • I'm experiencing the same behaviour on my MSK clusters. Did you end up finding the cause? – borgespires Mar 15 '22 at 15:18
  • Did you find any explanation for this? We experience the same symptom in a 3 m5.large broker cluster running Kafka 2.8.1. The cluster will run fine for a few days, then occasionally the amazon.msk.canary.group.broker-N groups for all 3 brokers go into a tight rebalance loop. The only solution is to do a rolling restart of all brokers in the cluster. – Dude0001 Jul 13 '22 at 16:54
  • 2
    We have the same issue and the response from AWS was "This from internal consumer groups managed by MSK. Amazon MSK creates and uses the following internal topics: __amazon_msk_canary and __amazon_msk_canary_state for cluster health and diagnostic metrics. Consumer groups (amazon.msk.canary*) shown in the logs are MSK's internal, therefore you do not need to be worried about them." – Joe M Aug 31 '22 at 01:26

1 Answers1

0

May it be something with the configuration of Java’s garbage collection on the brokers? I remember reading that a misconfiguration of the garbage collectors can cause the broker to pause for a few seconds and lose connectivity to the Zookeeper, hence the flipping behavior. Could you check whether you are applying any custom configuration for garbage collection? (i.e. via KAFKA_JVM_PERFORMANCE_OPTS environmental variable)

Ece Tavasli
  • 101
  • 1
  • 3
  • 1
    We have no control of such variables at all. All we can do is set some kafka configuration for MSK or increase number of brokers. – amorfis Nov 04 '21 at 14:55