6

Version

I'm using kafka 2.8.1 (latest 2.x at time of writing).

Background

I have a topic ingress with 64 partitions, 3x replication, and 8 brokers. I expanded the cluster to 12 brokers following the Expand your Cluster documentation. I do not like to use the --generate option for kafka-reassign-partitions.sh because it does not attempt to minimize data movement. I therefore created a manual new assignment moving replicas to the 4 new brokers, adjusting preferred leaders and making sure each broker has 16 replicas. I split the reassignment json into 16 parts so I can control moving replicas and not move the world all at once. This process is best practice (see docs here and here).

Mistake

However, I made a mistake with the first reassignment and I cancelled it with --cancel option for kafka-reassign-partitions.sh. The same script on --execute give you a json assignment to undo the reassignment for a rollback (see example at end). I also did not use this to rollback the cancelled reassignment. I corrected my json files and proceed to reassign, as I wanted, all 196 replicas. The docs here imply this should have correct it.

If such processes are not stopped, the effect of cancelling all pending reassignments will be negated anyway, by the creation of new reassignments.

The Problem

The cancelled reassignment incorrectly moved a partition 3 replica to broker 8 and even after completing the full reassignment for partition 3, a partial "orphaned" replica remains on broker 8. See here the directory sizes:

> kubectl exec kafka-8 -c kafka -- du -h /var/lib/kafka/data/topics
616G    /var/lib/kafka/data/topics/ingress-28
615G    /var/lib/kafka/data/topics/ingress-40
618G    /var/lib/kafka/data/topics/ingress-8
615G    /var/lib/kafka/data/topics/ingress-48
613G    /var/lib/kafka/data/topics/ingress-0
617G    /var/lib/kafka/data/topics/ingress-24
617G    /var/lib/kafka/data/topics/ingress-36
615G    /var/lib/kafka/data/topics/ingress-60
617G    /var/lib/kafka/data/topics/ingress-52
617G    /var/lib/kafka/data/topics/ingress-12
615G    /var/lib/kafka/data/topics/ingress-4
616G    /var/lib/kafka/data/topics/ingress-32
616G    /var/lib/kafka/data/topics/ingress-20
469G    /var/lib/kafka/data/topics/ingress-3 // <--- the orphaned partial replica. 
617G    /var/lib/kafka/data/topics/ingress-56
617G    /var/lib/kafka/data/topics/ingress-44
617G    /var/lib/kafka/data/topics/ingress-16
11T     /var/lib/kafka/data/topics

It does not show in the list of replicas

Topic: ingress  Partition: 3    Leader: 4       Replicas: 4,6,11        Isr: 11,6,4

The Question

What is the way to remove this orphaned replica? ideally without manually deleting it from the volume nor manually editing zookeeper nodes.

It would seem I cannot do it via kafka-reassign-partitions.sh because I've already asked Kafka to move replicas of partition 3 to brokers to 11, 6, and 4 -- not broker 8.

This replica is not being kept up to date with new writes, but it does show in the LogEndOffset metrics so kafka at some level is "aware" of this orphaned partition 3 replica.

partition 3 assignments

 {
      "topic": "ingress",
      "partition": 3,
      "replicas": [
        11,
        6,
        4
      ],
      "log_dirs": [
        "any",
        "any",
        "any"
      ]
    }

Related questions

There are several similar questions that hint at this problem but are old and for versions of kafka before the AdminAPI and therefore recommend manually editing zookeeper or files on disk which is undesirable for this production cluster.

Rollback json example

Current partition replica assignment

{"version":1,"partitions":[{"topic":"ingress","partition":16,"replicas":[1,5,8],"log_dirs":["any","any","any"]},{"topic":"ingress","partition":17,"replicas":[
2,6,9],"log_dirs":["any","any","any"]},{"topic":"ingress","partition":18,"replicas":[3,7,10],"log_dirs":["any","any","any"]},{"topic":"ingress","partition":19
,"replicas":[4,0,11],"log_dirs":["any","any","any"]}]}

Save this to use as the --reassignment-json-file option during rollback
Phil
  • 1,226
  • 10
  • 20
  • 1
    I believe you do need to manually `rm` from disk. Kafka/ZK do not "know" about the partition anymore, so it is a fine operation... Also [related tooling for partition movement](https://github.com/DataDog/kafka-kit/tree/master/cmd/topicmappr) – OneCricketeer Aug 17 '22 at 21:26

1 Answers1

1

I corrected the issue 2 ways:

  1. The first, as mentioned by @OneCricketeer and in the comments of the similar question, is to simply rm -r (optionally with -f) the extra partition replica on the broker that is not aware of it.

So far I haven't noticed any issue with kafka or zk and the metrics for the orphaned replica are also gone.

This was by far the fastest approach but came with apprehension to do this in production.

  1. The second option I also did was to add the broker back to the partition assignment list, execute with kafka-reassign-partitions.sh, and wait for the reassignment to "take over" the orphaned replica on the broker. Once it finished, then I removed the replica from the assignment and watched as kafka deleted the data in the directory.

This option used the kafka tools available but at noticeable cost in time to wait and in data movement, especially if the orphan has fallen far behind the in-sync replicas. It has to catch up, just to be deleted.

Finally, I will definitely try kafka-kit next time, thanks to @OneCricketeer and the confluent kafka community slack.

Phil
  • 1,226
  • 10
  • 20