31

I am looking to productionize and deploy my Kafka Connect application. However, there are two questions I have about the tasks.max setting which is required and of high importance but details are vague for what to actually set this value to.

If I have a topic with n partitions that I wish to consume data from and write to some sink (in my case, I am writing to S3), what should I set tasks.max to? Should I set it to n? Should I set it to 2n? Intuitively it seems that I'd want to set the value to n and that's what I've been doing.

What if I change my Kafka topic and increase partitions on the topic? I will have to pause my Kafka Connector and increase the tasks.max if I set it to n? If I have set a value of 2n, then my connector should automatically increase the parallelism it operates?

arghtype
  • 4,376
  • 11
  • 45
  • 60
PhillipAMann
  • 887
  • 1
  • 10
  • 19

2 Answers2

45

In a Kafka Connect sink, the tasks are essentially consumer threads and receive partitions to read from. If you have 10 partitions and have tasks.max set to 5, each task with receive 2 partitions to read from and track the offsets. If you have configured tasks.max to a number above the partition count Connect will launch a number of tasks equal to the partitions of the topics it's reading.

If you change the partition count of the topic you'll have to relaunch your connect task, if tasks.max is still greater than the partition count, Connect will start that many tasks.

edit, just discovered ConnectorContext: https://kafka.apache.org/0100/javadoc/org/apache/kafka/connect/connector/ConnectorContext.html

The connector will have to be written to include this but it looks like Connect has the ability to reconfigure a connector if there's a topic change (partitions added/removed).

Chris Matta
  • 3,263
  • 3
  • 35
  • 48
  • 1
    What if we are running Kafka connect in Distributed Mode with 2 machine. How consumer tasks(thread) are being spawned and get associated with worker machine if tasks are greater/less than the partitions ? – Vivek Pratap Singh Jan 04 '19 at 13:59
  • 2
    @Vivek, Distributed mode will distribute the tasks as evenly as possible on all running instances. – Chris Matta Jan 10 '19 at 15:57
3

We had a problem with the distribution of the workload between the Kafka-Connect(5.1.2) instances, caused by the high number of tasks.max than the number of partitions.

In our case, there were 10 Kafka Connect tasks and 3 partitions of the topic which is to be consumed. 3 of those 10 workers are assigned to the 3 partitions of the topic and the other 7 are not assigned to any partitions(which is expected) but the Kafka Connect were distributing the tasks evenly, without considering their workload. So we were ending up with a task distribution to our instances where some instances are staying idle( because they are not assigned to any unempty worker ) or some instances are working more than the others.

To come up with the issue, we set tasks.max equal to number of partitions of our topics.

It is really unexpected for us to see that Kafka Connect does not consider tasks' assignments while rebalancing. Also, I couldn't find any documentation for the tasks.max setting.

Mert Tunç
  • 31
  • 2