8

We've been using Kafka Connect for a while on a project, currently entirely using only the Confluent Kafka Connect JDBC connector. I'm struggling to understand the role of 'tasks' in Kafka Connect, and specifically with this connector. I understand 'connectors'; they encompass a bunch of configuration about a particular source/sink and the topics they connect from/to. I understand that there's a 1:Many relationship between connectors and tasks, and the general principle that tasks are used to parallelize work. However, how can we understand when a connector will/might create multiple tasks?

  • In the source connector case, we are using the JDBC connector to pick up source data by timestamp and/or a primary key, and so this seems in its very nature sequential. Indeed, all of our source connectors only ever seem to have one task. What would ever trigger Kafka Connect to create more than one connector? Currently we are running Kafka Connect in distributed mode, but only with one worker; if we had multiple workers, might we get multiple tasks per connector, or are the two not related?

  • In the sink connector case, we are explicitly configuring each of our sink connectors with tasks.max=1, and so unsurprisingly we only ever see one task for each connector there too. If we removed that configuration, presumably we could/would get more than one task. Would this mean the messages on our input topic might be consumed out of sequence? In which case, how is data consistency for changes assured?

Also, from time to time, we have seen situations where a single connector and task will both enter the FAILED state (because of input connectivity issues). Restarting the task will remove it from this state, and restart the flow of data, but the connector remains in FAILED state. How can this be - isn't the connector's state just the aggregate of all its child tasks?

Andrew Ferrier
  • 16,664
  • 13
  • 47
  • 76

1 Answers1

16

A task is a thread that performs the actual sourcing or sinking of data.

The number of tasks per connector is determined by the implementation of the connector. Take a Debezium source connector to MySQL as an example, since one MySQL instance writes to exactly one binlog file at a time and a file has to be read sequentially, one connector generates exactly one task.

Whereas for sink connectors, the number of tasks should be equal to the number of partitions of the topic.

The task distribution among workers is determined by task rebalance which is a very similar process to Kafka consumer group rebalance.

Wenli Wan
  • 605
  • 5
  • 8
  • Thanks, that's helpful, and roughly what I suspected. Do you know how this might affect the question re: FAILED status and it not being reflected from the task into the connector? Is that likely to be implementation-specific too? – Andrew Ferrier May 10 '21 at 10:32
  • 1
    Yes, I guess it is very likely implementation-specific. From my past experiences, the Debezium connector is able to reflect _most_ errors in `/status` API, but `UNKNOWN_TOPIC_OR_PARTITION` error is an exception. – Wenli Wan May 11 '21 at 09:45