So I've been using Kafka 3.1.0 in production environment. One of the VMs had to be live migrated, but due to some issues live migration failed and node has been forcefully migrated, involving full VM restart.
After that VM booted up, Kafka stopped working "completely" - clients were not able to connect and produce/consume anything. JMX metrics were still showing up, but that node showed many partitions as "Offline partitions".
Looking into the logs, that particular node kept showing A LOT of INCONSISTENT_TOPIC_ID
errors. Example:
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-2. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
WARN [ReplicaFetcher replicaId=4, leaderId=2, fetcherId=0] Received INCONSISTENT_TOPIC_ID from the leader for partition my-topic-3. This error may be returned transiently when the partition is being created or deleted, but it is not expected to persist. (kafka.server.ReplicaFetcherThread)
However, if you take a look at other Kafka brokers, they were showing a bit different errors (I don't have a logs sample) - UNKNOWN_TOPIC_ID
...
Another interesting issue - I've described Kafka topic and this is what I've got:
Topic: my-topic TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 4 ReplicationFactor: 4 Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
Topic: my-topic Partition: 0 Leader: 2 Replicas: 5,2,3,0 Isr: 2
Topic: my-topic Partition: 1 Leader: 0 Replicas: 0,1,2,3 Isr: 0
Topic: my-topic Partition: 2 Leader: 2 Replicas: 1,2,3,4 Isr: 2
Topic: my-topic Partition: 3 Leader: 2 Replicas: 2,3,4,5 Isr: 2
Why does it show only 1 ISR when there should be 4 per partition? Why did it happen in the first place?
I've added additional partition and this is what it shows now:
Topic: my-topic TopicId: XXXXXXXXXXXXXXXXXXXXXX PartitionCount: 5 ReplicationFactor: 4 Configs: segment.bytes=214748364,unclean.leader.election.enable=true,retention.bytes=214748364
Topic: my-topic Partition: 0 Leader: 2 Replicas: 5,2,3,0 Isr: 2
Topic: my-topic Partition: 1 Leader: 0 Replicas: 0,1,2,3 Isr: 0
Topic: my-topic Partition: 2 Leader: 2 Replicas: 1,2,3,4 Isr: 2
Topic: my-topic Partition: 3 Leader: 2 Replicas: 2,3,4,5 Isr: 2
Topic: my-topic Partition: 4 Leader: 3 Replicas: 3,4,5,0 Isr: 3,4,5,0
I know there is kafka-reassign-partitions.sh
script and it fixed similar issue in preproduction environment, but I am more interested why did it happen in the first place?
Could this be related? I've set the parameter replica.lag.time.max.ms=5000
(over default 500
) and even after restarting all nodes it didn't help.