I am trying to read from a Kafka topic in my Spark batch job and publish to another topic. I am not using streaming because it does not fit our use case. According to the spark docs, the batch job starts reading from the earliest Kafka offsets by default, and so when I run the job again, it again reads from the earliest. How do I make sure that the job picks up the next offset from where it last read?
According to the Spark Kafka Integration docs, there are options to specify "startingOffsets" and "endingOffsets". But how do I figure them out?
I am using the spark.read.format("kafka")
API to read the data from Kafka as a Dataset. But I did not find any option to get the start and end offset range from this Dataset read.