We've been using Kafka Connect for a while on a project, currently entirely using only the Confluent Kafka Connect JDBC connector. I'm struggling to understand the role of 'tasks' in Kafka Connect, and specifically with this connector. I understand 'connectors'; they encompass a bunch of configuration about a particular source/sink and the topics they connect from/to. I understand that there's a 1:Many
relationship between connectors and tasks, and the general principle that tasks are used to parallelize work. However, how can we understand when a connector will/might create multiple tasks?
In the source connector case, we are using the JDBC connector to pick up source data by timestamp and/or a primary key, and so this seems in its very nature sequential. Indeed, all of our source connectors only ever seem to have one task. What would ever trigger Kafka Connect to create more than one connector? Currently we are running Kafka Connect in distributed mode, but only with one worker; if we had multiple workers, might we get multiple tasks per connector, or are the two not related?
In the sink connector case, we are explicitly configuring each of our sink connectors with
tasks.max=1
, and so unsurprisingly we only ever see one task for each connector there too. If we removed that configuration, presumably we could/would get more than one task. Would this mean the messages on our input topic might be consumed out of sequence? In which case, how is data consistency for changes assured?
Also, from time to time, we have seen situations where a single connector and task will both enter the FAILED state (because of input connectivity issues). Restarting the task will remove it from this state, and restart the flow of data, but the connector remains in FAILED state. How can this be - isn't the connector's state just the aggregate of all its child tasks?