Version
I'm using kafka 2.8.1 (latest 2.x at time of writing).
Background
I have a topic ingress
with 64 partitions, 3x replication, and 8 brokers. I expanded the cluster to 12 brokers following the Expand your Cluster documentation. I do not like to use the --generate
option for kafka-reassign-partitions.sh
because it does not attempt to minimize data movement. I therefore created a manual new assignment moving replicas to the 4 new brokers, adjusting preferred leaders and making sure each broker has 16 replicas. I split the reassignment json into 16 parts so I can control moving replicas and not move the world all at once. This process is best practice (see docs here and here).
Mistake
However, I made a mistake with the first reassignment and I cancelled it with --cancel
option for kafka-reassign-partitions.sh
. The same script on --execute
give you a json assignment to undo the reassignment for a rollback (see example at end). I also did not use this to rollback the cancelled reassignment. I corrected my json files and proceed to reassign, as I wanted, all 196 replicas. The docs here imply this should have correct it.
If such processes are not stopped, the effect of cancelling all pending reassignments will be negated anyway, by the creation of new reassignments.
The Problem
The cancelled reassignment incorrectly moved a partition 3 replica to broker 8 and even after completing the full reassignment for partition 3, a partial "orphaned" replica remains on broker 8. See here the directory sizes:
> kubectl exec kafka-8 -c kafka -- du -h /var/lib/kafka/data/topics
616G /var/lib/kafka/data/topics/ingress-28
615G /var/lib/kafka/data/topics/ingress-40
618G /var/lib/kafka/data/topics/ingress-8
615G /var/lib/kafka/data/topics/ingress-48
613G /var/lib/kafka/data/topics/ingress-0
617G /var/lib/kafka/data/topics/ingress-24
617G /var/lib/kafka/data/topics/ingress-36
615G /var/lib/kafka/data/topics/ingress-60
617G /var/lib/kafka/data/topics/ingress-52
617G /var/lib/kafka/data/topics/ingress-12
615G /var/lib/kafka/data/topics/ingress-4
616G /var/lib/kafka/data/topics/ingress-32
616G /var/lib/kafka/data/topics/ingress-20
469G /var/lib/kafka/data/topics/ingress-3 // <--- the orphaned partial replica.
617G /var/lib/kafka/data/topics/ingress-56
617G /var/lib/kafka/data/topics/ingress-44
617G /var/lib/kafka/data/topics/ingress-16
11T /var/lib/kafka/data/topics
It does not show in the list of replicas
Topic: ingress Partition: 3 Leader: 4 Replicas: 4,6,11 Isr: 11,6,4
The Question
What is the way to remove this orphaned replica? ideally without manually deleting it from the volume nor manually editing zookeeper nodes.
It would seem I cannot do it via kafka-reassign-partitions.sh
because I've already asked Kafka to move replicas of partition 3 to brokers to 11, 6, and 4 -- not broker 8.
This replica is not being kept up to date with new writes, but it does show in the LogEndOffset
metrics so kafka at some level is "aware" of this orphaned partition 3 replica.
partition 3 assignments
{
"topic": "ingress",
"partition": 3,
"replicas": [
11,
6,
4
],
"log_dirs": [
"any",
"any",
"any"
]
}
Related questions
There are several similar questions that hint at this problem but are old and for versions of kafka before the AdminAPI and therefore recommend manually editing zookeeper or files on disk which is undesirable for this production cluster.
- most similar, but unanswered: Delete unused Kafka partition
- Kafka 0.10.0.1 partition reassignment after broker failure
- How to remove an inconsistent kafka topic metadata data from kafka_2.10-0.8.1.1
- Re-assignment of Partition gets infinitely stuck in "Still in progress" state
Rollback json example
Current partition replica assignment
{"version":1,"partitions":[{"topic":"ingress","partition":16,"replicas":[1,5,8],"log_dirs":["any","any","any"]},{"topic":"ingress","partition":17,"replicas":[
2,6,9],"log_dirs":["any","any","any"]},{"topic":"ingress","partition":18,"replicas":[3,7,10],"log_dirs":["any","any","any"]},{"topic":"ingress","partition":19
,"replicas":[4,0,11],"log_dirs":["any","any","any"]}]}
Save this to use as the --reassignment-json-file option during rollback