0

when a Kafka topic has multiple consumers (from different consumer group), who would maintain keeping track of offsets or the data read (consumer-group, topic, partition)? Is it Kafka or consumers have to implement a logic to keep track of the data read? for ex: if Spark is reading the data from Kafka it maintains a checkpoint keeping track of the data read.

steve
  • 129
  • 2
  • 9

1 Answers1

1

Kafka has an internal topic __consumer_offsets that Consumers commit their offsets to when using a consumer group.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • when reading the Kafka topic from Spark (pull), who is responsible for taking care of tracking offsets/data already read for each consumer. Is it kafka using __consumer_offsets topic or spark using checkpoint? – steve Aug 02 '23 at 12:33
  • 1
    Until around Spark 3, there was no configuration option for `kafka.group.id`, but I recall someone telling me that that wasn't used to track state, maybe only consumer lag. In that case, it should be checkpoint files. https://stackoverflow.com/questions/64003405/how-to-use-kafka-group-id-and-checkpoints-in-spark-3-0-structured-streaming-to-c – OneCricketeer Aug 02 '23 at 13:17
  • looking at https://stackoverflow.com/questions/64003405/how-to-use-kafka-group-id-and-checkpoints-in-spark-3-0-structured-streaming-to-c, it seems Spark is tracking the offsets. In that case what is the significance of __consumer_offsets in spark-kafka setup? – steve Aug 02 '23 at 15:51
  • 1
    Kafka always has that topic, so it's unrelated to Spark. In the Spark Streaming docs, it documents [how to store offsets back in Kafka, or use "external store" like Zookeeper](https://spark.apache.org/docs/3.4.1/streaming-kafka-0-10-integration.html#storing-offsets), but Structured Streaming, there is a new setting in Spark 3.1 to use "deprecated offset fetching" - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#offset-fetching – OneCricketeer Aug 02 '23 at 17:14