I have simple spark streaming application which reads data from Kafka and then send this data after transformation on a http end point (or another kafka - for this question let's consider http). I am submitting jobs using job-server.
I am currently starting the consumption from source kafka with "auto.offset.reset"="smallest" and interval=3s. In happy case everything looks good. Here's an excerpt:
kafkaInputDStream.foreachRDD(rdd => {
rdd.foreach(item => {
//This will throw exception if http endpoint isn't reachable
httpProcessor.process(item._1, item._2)
})
})
Since "auto.offset.reset"="smallest", this processes about 200K messages in one job. If I stop http server mid-job (simulating some issue in POSTing) and httpProcessor.process throws exception, that Job fails and whatever is unprocessed is lost. I see it keeps on polling every 3 seconds after that.
So my question is:
- Is my assumption right that if in next 3 second job if it got X messages and only Y could be processed before hitting an error, rest X-Y will not be processed?
- Is there a way to pause the stream/consumption from Kafka? For instance in case there's a intermittent network issue and most likely all messages consumed will be lost in that time. Something which keeps on retrying (maybe exponential backoff) and whenever http end point is UP, start consuming again.
Thanks