Kafka Streams: How to ensure offset is committed after processing is completed

Question

I want to process messages present in a Kafka topic using Kafka streams.

The last step of the processing is to put the result in a database table. To avoid database contention related issues(the program is going to run 24*7 and process millions of messages), I will be using batching for JDBC calls.

But in this case, there is a possibility of messages getting lost(in a scenario, I read 500 messages from a topic, streams will mark offset, now the program fails. Messages present in JDBC batch update are lost but the offset is marked for those messages).

I want to manually mark the offset of the last message once the database insert/update is complete, but it is not possible according to the following question: How to commit manually with Kafka Stream?.

Can someone please suggest any possible solution

Why not write the result into an output topic an use Kafka Connect to get the data into the DB ? — Matthias J. Sax, Nov 14 '19 at 02:49
Just wanted to confirm if I can ensure batch size(for example, exact number of inserts/updates, say 5000) using kafka connect. I read following question related to batching and it suggests batching is done on best case basis, but it is not exact always. To ensure I am interpreting it correctly. question link: https://stackoverflow.com/questions/58552372/batch-size-in-kafka-jdbc-sink-connector — Jay Liya, Nov 14 '19 at 07:07
I am not familiare will all details -- in the end, I would assume that it depends on the concrete connector you pick. The other question is, why is it important that the batching is exact? — Matthias J. Sax, Nov 14 '19 at 07:09
I want to reduce the number of database calls as much as possible. My use case is such where I will be having messages added in my source topic 24*7. If I allow messages to be pushed without exact size in batching, I fear it may hamper database performance due to so many calls from this application — Jay Liya, Nov 14 '19 at 07:16
I would not worry about this -- even if there is no guarantee about the batch size, I would assume that as long as the data rate is high, the batch will be filled up -- only for lower data rates, batch sizes might be smaller, but this implies that calls to the DB will still be in larger intervals due to smaller data rate. — Matthias J. Sax, Nov 15 '19 at 03:17

score 4 · Answer 1 · answered Nov 15 '19 at 10:21

As alluded to in @sun007's answer, I'd rather change your approach slightly:

Use Kafka Streams to process the input data. Let the Kafka Streams application write its output to Kafka, not to the relational database.
Use Kafka Connect (e.g., the ready-to-use JDBC connector) to then ingest the data from Kafka to the relational database. Configure and tune the connector as needed, e.g. for batch-inserts into the database.

This decoupling of processing (Kafka Streams) and ingestion (Kafka Connect) is typically a much more preferable design. For example, you no longer couple the processing step with the availability of the database: why should your KStreams application stop if the DB is down? That's an operational concern that shouldn't matter to the processing logic, where you certainly don't want to deal with timeouts, retries, and so on. (Even if you used a tool other than Kafka Streams for processing, this decoupling is still a preferable setup.)

What if my destination is also Kafka topic, but my processing is events enrichment with data from external API? Attempt to reduce the amount of separate IO operations is pretty reasonable. — SerG, Sep 23 '20 at 19:10

Nitin · Accepted Answer · 2020-09-27T03:10:47.837

3

Kafka Stream doesn't support manual commit and at the same time it doesn't support batch processing as well. With respect to your uses case there are few possibilities:

Use Normal consumer and implement batch processing and control manual offset.
Use Spark Kafka Structured stream as per below Kafka Spark Structured Stream
Try Spring Kafka [Spring Kafka]2
In this kind of scenario there are possibilities to consider JDBC Kafka Connector as well. Kafka JDBC Connector

edited Sep 27 '20 at 03:10

answered Nov 12 '19 at 16:28

Nitin

3,533
2
26
36

Thanks. One more question, this is more of a limitation in kafka streams which doesn't support batching of SQL operations, if I am interpreting correctly? – Jay Liya Nov 12 '19 at 16:57
Yes Kafka Stream doesn't support batch processing – Nitin Nov 12 '19 at 17:00
Thanks for the quick help :) – Jay Liya Nov 12 '19 at 17:02

Kafka Streams: How to ensure offset is committed after processing is completed

2 Answers2