Spark Offset Management in Kafka

Question

I am using Spark Structured Streaming (Version 2.3.2). I need to read from Kafka Cluster and write into Kerberized Kafka. Here I want to use Kafka as offset checkpointing after the record is written into Kerberized Kafka.

Questions:

Can we use Kafka for checkpointing to manage offset or do we need to use only HDFS/S3 only?

Please help.

Does this answer your question? [How to manually set group.id and commit kafka offsets in spark structured streaming?](https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-in-spark-structured-stream) — Michael Heil, Sep 30 '20 at 07:47
I want to commit the offset in source Kafka after the write is done on the sink kafka, till then i don't want to commit the offset. — Siva Samraj, Sep 30 '20 at 07:56
You need to work with the framework, not against it. As @mike states as well. — thebluephantom, Sep 30 '20 at 09:46

score 1 · Accepted Answer · answered Sep 30 '20 at 10:14

Can we use Kafka for checkpointing to manage offset

No, you cannot commit offsets back to your source Kafka topic. This is described in detail here and of course in the official Spark Structured Streaming + Kafka Integration Guide.

or do we need to use only HDFS/S3 only?

Yes, this has to be something like HDFS or S3. This is explained in section Recovering from Failures with Checkpointing of the StructuredStreaming Programming Guide: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."

Spark Offset Management in Kafka

1 Answers1