0

As I see from the documentation and other references, it seems the connector will be instantiated with a single task no matter the value defined through the property (tasks.num)

  1. Whether this property tasks.num will have any impact like in the case of fail over etc ..? Say , if tasks.num is configured with 2 and a jdbc connector is used with a single task and if that task fails and other will take over ?
  2. What is the significance of distributed mode in this case, effectively, the connector is created with a single task ?
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Nag
  • 1,818
  • 3
  • 24
  • 41

1 Answers1

1

For the source connector, as linked, this is because it uses a single Change Stream cursor. How exactly do you expect more than one task to not get conflicting information such as read the same data and duplicate it into the topic?

Connect runs sources and sinks. Many sources only support single tasks, but it depends on their internal threading model; for example, you could have one task per collection/table, but if there's only one unified item, such as a change-stream or binlog, then there can only be one task. You've mentioned JDBC, however Debezium would be preferred for CDC, if it supports your database.

Distribution is also for fault tolerance, not just scalability. Only some exceptions are recoverable and can be restarted on other nodes

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • that makes sense. - if source is limited by single task / using multiple tasks will cause duplicates etc .. - the applications might be limited to the single task alone, they might not get the benefits of "true" distributed model. How it is achieved in reality - production env ? jdbc connectors will always use single task for most of the production scenarios ? – Nag Jan 15 '22 at 18:21
  • Not sure I understand the question. It's the same limitation in any environment. Like I said, Debezium might be preferred in production, which does not use JDBC – OneCricketeer Jan 16 '22 at 00:30
  • If we wanted to parallelize and make sure to use it "true" distrubuted sense , how we can deploy this in prod if it is limited by single task ? – Nag Jan 16 '22 at 02:02
  • One database table cannot be parallelized. Multiple tables can be. You start multiple connectors, each with one task, across multiple Connect workers – OneCricketeer Jan 16 '22 at 05:09