when a Kafka topic has multiple consumers (from different consumer group), who would maintain keeping track of offsets or the data read (consumer-group, topic, partition)? Is it Kafka or consumers have to implement a logic to keep track of the data read? for ex: if Spark is reading the data from Kafka it maintains a checkpoint keeping track of the data read.
Asked
Active
Viewed 21 times
1 Answers
1
Kafka has an internal topic __consumer_offsets
that Consumers commit their offsets to when using a consumer group.

OneCricketeer
- 179,855
- 19
- 132
- 245
-
when reading the Kafka topic from Spark (pull), who is responsible for taking care of tracking offsets/data already read for each consumer. Is it kafka using __consumer_offsets topic or spark using checkpoint? – steve Aug 02 '23 at 12:33
-
1Until around Spark 3, there was no configuration option for `kafka.group.id`, but I recall someone telling me that that wasn't used to track state, maybe only consumer lag. In that case, it should be checkpoint files. https://stackoverflow.com/questions/64003405/how-to-use-kafka-group-id-and-checkpoints-in-spark-3-0-structured-streaming-to-c – OneCricketeer Aug 02 '23 at 13:17
-
looking at https://stackoverflow.com/questions/64003405/how-to-use-kafka-group-id-and-checkpoints-in-spark-3-0-structured-streaming-to-c, it seems Spark is tracking the offsets. In that case what is the significance of __consumer_offsets in spark-kafka setup? – steve Aug 02 '23 at 15:51
-
1Kafka always has that topic, so it's unrelated to Spark. In the Spark Streaming docs, it documents [how to store offsets back in Kafka, or use "external store" like Zookeeper](https://spark.apache.org/docs/3.4.1/streaming-kafka-0-10-integration.html#storing-offsets), but Structured Streaming, there is a new setting in Spark 3.1 to use "deprecated offset fetching" - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#offset-fetching – OneCricketeer Aug 02 '23 at 17:14