0

I have been trying to read data out of MySQL using Kafka connect using MySQL source connector for database and debezium connector for bin logs. I am trying to understand what could be the better way to pull the change data. Bin logs has overhead of writing to logs etc while reading from database has the overhead of querying database. What are the other major advantages and disadvantages that are associated with both of these approaches? What could be a better way of capturing change data? Also starting from MySQL 8 the bin logs are enabled by default. Does this mean it could be a better way of doing things?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
pacman
  • 725
  • 1
  • 9
  • 28

2 Answers2

3

This question can be summarized as follows:

What are the pros and cons of a log-based CDC (represented by Debezium Connector) versus a polling-based CDC (represented by JDBC Source Connector)?

Query-based CDC:

  • ✓ Usually easier to setup
  • ✓ Requires fewer permissions
  • ✗ Impact of polling the DB
  • ✗ Needs specific columns in source schema to track changes
  • ✗ Can't track deletes
  • ✗ Can't track multiple events between polling interval

Log-based CDC:

  • ✓ All data changes are captured
  • ✓ Low delays of events while avoiding increased CPU load
  • ✓ No impact on data model
  • ✓ Can capture deletes
  • ✓ Can capture old record state and further meta data
  • ✗ More setup steps
  • ✗ Higher system previleges required
  • ✗ Can be expensive for some proprietary DB

Reference:

  1. Five Advantages of Log-Based Change Data Capture by Gunnar Morling
  2. No More Silos: How to Integrate Your Databases with Apache Kafka and CDC by Robin Moffatt
  3. StackOverflow: Kafka Connect JDBC vs Debezium CDC
Iskuskov Alexander
  • 4,077
  • 3
  • 23
  • 38
  • can you give some names of proprietary db for which logs are expensive? Also I came to understand that form MySQL 8 log bins are enabled by default. Does this mean that log bins are becoming less heavier in atleast some asepcts? – pacman Jan 08 '21 at 05:23
  • Also for query based CDC what do u mean when u say cant track multiple events between polling interval? – pacman Jan 08 '21 at 05:36
3

The list given by @Iskuskov Alexander is great. I'd add a few more points:

  • Log-based CDC also requires writing to logs (you mentioned this in your question). This has overhead not only for performance, but also storage space.

  • Log-based CDC requires a continuous stream of logs. If the CDC misses a log, then the replica cannot be kept in sync, and the whole replica must be replaced by a new replica initialized by a new snapshot of the database.

  • If your CDC is offline periodically, this means you need to keep logs until the CDC runs, and this can be hard to predict how long it will be. This leads to needing more storage space.

That said, query-based CDC has its own drawbacks. At my company, we have used a query-based CDC, but we found that it is inconvenient, and we're working on replacing it with a Debezium log-based solution. For many of the reasons in the other answer, and also:

  • Query-based CDC makes it hard to keep schema changes in sync with the replica, so if a schema change occurs in the source database, it may require the replica be trashed and replaced with a fresh snapshot.

  • The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot, and users don't like this downtime. Also snapshot transfers increase the network bandwidth requirements.

Neither solution is "better" than the other. Both have pros and cons. Your job as an engineer is to select the option that fits your project's requirements the best. In other words, choose the one whose disadvantages are least bad for your needs.

We can't make that choice for you, because you know your project better than we do.


Re your comments:

Enabling binary logs has no overhead for read queries, but significant overhead for write queries. The overhead became greater in MySQL 8.0, as measured by Percona CTO Vadim Tkachenko and reported here: https://www.percona.com/blog/2018/05/04/how-binary-logs-affect-mysql-8-0-performance/

He concludes the overhead of binary logs is about 13% for MySQL 5.7, and up to 30% for MySQL 8.0.

Can you also explain "The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot"? Do you mean building a replication database?

Yes, if you need to build a new replica, you acquire a snapshot of the source database and import it to the replica. Every step of this takes time:

  1. Create the snapshot of the source
  2. Transfer the snapshot to the host where the replica lives
  3. Import the snapshot into the replica instance

How long depends on the size of the database, but it can be hours or even days. While waiting for this, users can't use the replica database, at least not if they want their queries to analyze a complete copy of the source data. They have to wait for the import to finish.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Thanks for the beautiful explanation Bill. I just want to know what one more thing if you ever have experienced. What is the level of effect of Bin Logs on throughput and latency of the database. Will it slow down the whole database? Storage is an aspect but performance is something that's worrying me. – pacman Jan 08 '21 at 04:46
  • Can you also explain "The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot"? Do you mean building a replication database? – pacman Jan 08 '21 at 05:39