Apache Kafka in kraft mode fails frequently

Question

We have created a 3 node kafka-3.3.1 cluster in kraft mode. This is based on bitnami-kafka image. Basic configuration for all nodes are (port number is different for each and other changes as required)

      KAFKA_ENABLE_KRAFT: 'yes'
      KAFKA_KRAFT_CLUSTER_ID: xxyyddjjjddkk1234
      KAFKA_CFG_PROCESS_ROLES: broker,controller
      KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_CFG_LISTENERS: CONTROLLER://:9093,INSIDE://:9092,EXTERNAL://:9094
      KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,INSIDE:PLAINTEXT,EXTERNAL:PLAINTEXT
      KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 1@dpkafka01:9093,2@dpkafka02:9093,3@dpkafka03:9093
      KAFKA_CFG_ADVERTISED_LISTENERS: INSIDE://dpkafka02:9092,EXTERNAL://_{HOSTIP}:9098
      KAFKA_BROKER_ID: 2
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_HEAP_OPTS: "-Xmx1G -Xms256m"
      KAFKA_LOG_DIRS: /bitnami/kafka/kafka-logs
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'false'
      KAFKA_LOG_RETENTION_MS: 7200000
      KAFKA_LOG_SEGMENT_MS: 86400000
      KAFKA_LOG_DELETE_RETENTION_MS: 7200000
      KAFKA_LOG_RETENTION_CHECK_INTERVAL_MS: 60000
      KAFKA_LOG_CLEANUP_POLICY: "compact,delete"
      KAFKA_CFG_GROUP_INITIAL_REBALANCE_DELAY_MS: 12000
      KAFKA_CFG_NUM_RECOVERY_THREADS_PER_DATA_DIR: 4
      KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
      KAFKA_CFG_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
      KAFKA_CFG_TRANSACTION_STATE_LOG_MIN_ISR: 2
      ALLOW_PLAINTEXT_LISTENER: 'yes'
      BITNAMI_DEBUG: 'true'
      KAFKA_OPTS: -javaagent:/opt/bitnami/kafka/libs/jmx_prometheus_javaagent.jar=7072:/opt/bitnami/kafka/libs/prom-jmx-agent-config.yml

While the cluster works for a while, one or two of them shuts down very frequently. Logs are not very helpful to identify the root cause. Some relevant logs we see before the state changes to shutdown are:

[2022-12-04 08:35:16,928] INFO [RaftManager nodeId=2] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient)
[2022-12-04 08:35:17,414] INFO [RaftManager nodeId=2] Disconnecting from node 3 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:17,414] INFO [RaftManager nodeId=2] Cancelled in-flight FETCH request with correlation id 73082 due to node 3 being disconnected (elapsed time since creation: 2471ms, elapsed time since send: 2471ms, request timeout: 2000ms) (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:27,508] INFO [RaftManager nodeId=2] Completed transition to CandidateState(localId=2, epoch=31047, retries=1, electionTimeoutMs=1697) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:27,508] INFO [Controller 2] In the new epoch 31047, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:27,802] INFO [RaftManager nodeId=2] Completed transition to Unattached(epoch=31048, voters=[1, 2, 3], electionTimeoutMs=0) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:27,802] INFO [Controller 2] In the new epoch 31048, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:27,815] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat] Client requested disconnect from node 3 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:27,815] INFO [BrokerLifecycleManager id=2] Unable to send a heartbeat because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager)
[2022-12-04 08:35:27,830] INFO [RaftManager nodeId=2] Completed transition to Voted(epoch=31048, votedId=1, voters=[1, 2, 3], electionTimeoutMs=1014) (org.apache.kafka.raft.QuorumState)

.....

[2022-12-04 08:35:32,210] INFO [Broker id=2] Stopped fetchers as part of become-follower for 479 partitions (state.change.logger)
[2022-12-04 08:35:32,211] INFO [Broker id=2] Started fetchers as part of become-follower for 479 partitions (state.change.logger)
[2022-12-04 08:35:32,232] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,232] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Client requested connection close from node 1 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:32,233] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Cancelled in-flight FETCH request with correlation id 675913 due to node 1 being disconnected (elapsed time since creation: 4394ms, elapsed time since send: 4394ms, request timeout: 30000ms) (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:32,233] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1961820001, epoch=181722) to node 1: (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Client was shutdown before response was read
    at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:108)
    at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockingSender.scala:113)
    at kafka.server.RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:78)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:309)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:124)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:123)
    at scala.Option.foreach(Option.scala:407)
    at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:123)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:106)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2022-12-04 08:35:32,234] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,234] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,245] INFO [GroupCoordinator 2]: Resigned as the group coordinator for partition 13 in epoch Some(3200) (kafka.coordinator.group.GroupCoordinator)

....

[2022-12-04 08:35:48,229] INFO [Controller 2] Unfenced broker: 2 (org.apache.kafka.controller.ClusterControlManager)
[2022-12-04 08:35:48,254] INFO [RaftManager nodeId=2] Completed transition to Unattached(epoch=31055, voters=[1, 2, 3], electionTimeoutMs=1607) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:48,254] INFO [RaftManager nodeId=2] Vote request VoteRequestData(clusterId='<redacted>', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=31055, candidateId=3, lastOffsetEpoch=31052, lastOffset=6552512)])]) with epoch 31055 is rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-12-04 08:35:48,254] WARN [Controller 2] Renouncing the leadership due to a metadata log event. We were the leader at epoch 31052, but in the new epoch 31055, the leader is (none). Reverting to last committed offset 6552511. (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 8243762 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] alterPartition: failed with NotControllerException in 8005283 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 7743806 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 7243753 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 7151815 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 7151616 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 6743693 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 6243134 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 5742969 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 5242852 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 4742694 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 4242529 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 3742380 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 3242258 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 2741822 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 2241677 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 1741549 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 1241369 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 741246 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] maybeFenceReplicas: failed with NotControllerException in 244485 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 241049 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] maybeFenceReplicas: failed with NotControllerException in 196629 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 27063 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,255] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat] Client requested disconnect from node 2 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:48,255] ERROR Encountered fatal fault: exception while renouncing leadership (org.apache.kafka.server.fault.ProcessExitingFaultHandler)
java.lang.NullPointerException
    at org.apache.kafka.timeline.SnapshottableHashTable$HashTier.mergeFrom(SnapshottableHashTable.java:125)
    at org.apache.kafka.timeline.Snapshot.mergeFrom(Snapshot.java:68)
    at org.apache.kafka.timeline.SnapshotRegistry.deleteSnapshot(SnapshotRegistry.java:236)
    at org.apache.kafka.timeline.SnapshotRegistry$SnapshotIterator.remove(SnapshotRegistry.java:67)
    at org.apache.kafka.timeline.SnapshotRegistry.revertToSnapshot(SnapshotRegistry.java:214)
    at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1232)
    at org.apache.kafka.controller.QuorumController.access$3300(QuorumController.java:150)
    at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$handleLeaderChange$3(QuorumController.java:1076)
    at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$appendRaftEvent$4(QuorumController.java:1101)
    at org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:496)
    at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
    at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
    at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
    at java.base/java.lang.Thread.run(Thread.java:829)
[2022-12-04 08:35:48,259] INFO [BrokerServer id=2] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)
[2022-12-04 08:35:48,259] INFO [BrokerServer id=2] shutting down (kafka.server.BrokerServer)
[2022-12-04 08:35:48,261] INFO [BrokerLifecycleManager id=2] Beginning controlled shutdown. (kafka.server.BrokerLifecycleManager)
[2022-12-04 08:35:48,277] INFO [RaftManager nodeId=2] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=31055, leaderId=3, voters=[1, 2, 3], highWatermark=Optional[LogOffsetMetadata(offset=6552512, metadata=Optional[(segmentBaseOffset=6497886,relativePositionInSegment=3821894)])], fetchingSnapshot=Optional.empty) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:48,355] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat]: Recorded new controller, from now on will use broker dpkafka03:9093 (id: 3 rack: null) (kafka.server.BrokerToControllerRequestThread)

Appreciate if anyone experienced with Kraft mode Kafka cluster can provide some guidance to debug this issue. Other problem is that the container doesn't exit after the error which makes the services to fail. Container will be restarted by our orchestration layer if it exits. (this is a different problem that as we use Bitnami images)

I also didn't find many production examples out there that uses kraft mode. Are we missing some configuration or do we need to change any default configuration values such as request timeout in Kraft mode?

So without kraft you don't have this error? Did you have other issues with kraft? Does it work without docker? Can you conclude kraft is safe to use in production? — sekrett, Dec 06 '22 at 15:30
Our kraft based clusters are running Kafka-3.3.1. We have 6 clusters and 2 of them are giving this error occasionally. All are 3 node clusters. I'm planning to increase number of nodes based on a suggestion from another user to ensure minimum 3 quorum voters always. We also have Kafka 2.8.1 clusters that use Zookeeper for cluster management. They are also very sensitive to network changes. I was hoping that reading logs during booting will be faster with 3.x. I didn't see any significant improvement. These are running on EC2 instances in a Docker swarm based environment. — donnie, Dec 06 '22 at 16:58
We are having the same issue. In our case, the error popped up unexpectedly - everything worked and then everything went kaput. — Aladin, Dec 14 '22 at 17:11
Update: we have 5 nodes where each node is both a controllers and a broker. Now, we saw that we get this error when we start nodes 1,2,5, while starting nodes 2,3,4 is fine. — Aladin, Dec 15 '22 at 13:16
@Aladin You mean quantity of nodes? You have repeated 2, does is work or not? — sekrett, Dec 22 '22 at 10:05
@sekrett Yes, this is not a mistake. By numbers I mean indices - node 1, node 2... node 5. The repetition of node 2 is not a mistake neither. What helped us is to completely remove Kafka cluster metadata in node 1 and 2. However, since it's not production, we are not sure if it's safe. I'm not sure why did that happen, but I assume that we had some unsynced nodes. Which hardened on the leader decision for a specific partition. I do remember that it all started after we created a new topic which was not created on nodes 1 and 5. — Aladin, Dec 22 '22 at 11:47

Iblis Lin · Answer 1 · 2023-01-05T00:35:36.707

I just consulted my workmate about that NullPointerException and got this patch from him (https://github.com/alexkuoecity).

diff --git a/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java b/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
index 299f65a6f7..e87ce22264 100644
--- a/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
+++ b/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
@@ -105,6 +105,7 @@ class SnapshottableHashTable<T extends SnapshottableHashTable.ElementWithStartEp

         HashTier(int size) {
             this.size = size;
+            this.deltaTable = new BaseHashTable<T>(size);
         }

         @SuppressWarnings("unchecked")

I applied this to the branch 3.3 of kafka, seems work.But I still have no idea about the root cause, so use it at your own risk.

$ git clone https://github.com/apache/kafka.git
$ cd kafka
$ git checkout 3.3
$ patch < file.patch
$ ./gradlew releaseTarGz

Then, copy the whole kafka dir into your docker image and run.

sekrett · Answer 2 · 2022-12-06T15:39:24.630

This happens to me also. When I deploy, docker compose starts fine, but after I reboot the VM entirely, I have another error:

kafka_1             | [2022-12-06 15:23:04,721] ERROR [Controller 1] writeNoOpRecord: unable to start processing because of TimeoutException. (org.apache.kafka.controller.QuorumController)
kafka_1             | [2022-12-06 15:23:04,721] ERROR [Controller 1] maybeBalancePartitionLeaders: unable to start processing because of TimeoutException. (org.apache.kafka.controller.QuorumController)

Probably it is related to VM performance, at boot stage it is slower.

So I do docker compose restart and it works again. Finally I fixed it by adding restart: always, at second attempt it works.

I did not see any problems so far, but I did not test at high load.

In our case, nodes fail in running state. I didn't see much issues with booting. Another problem is with the Bitnami image that we use which doesn't exist when the state changes to Shutdown. Containers will be restarted by our orchestration layer if they exit. — donnie, Dec 06 '22 at 17:03

Apache Kafka in kraft mode fails frequently

2 Answers2