I have a kafka streams application (Kafka v1.1.0) with multiple(24) topics. Four of these topics are source topics and the remaining are destination topics. They seem to have reprocessed data on changing the system time to a previous date. I have the default broker configs i.e. :
auto.offset.reset = latest
offsets.retention.minutes = 1440 #1 day
log.retention.hours = 168 #7 days
I have looked into the following links in details and the sub-links posted in the answers:
1) Kafka Stream reprocessing old messages on rebalancing
2) How does an offset expire for an Apache Kafka consumer group?
The following JIRA discussion also states this issue:
https://issues.apache.org/jira/browse/KAFKA-3806
After reading up on this I have established an understanding of the cases in which stream consumers might reprocess data.
However, with the default configs mentioned above (the ones being used for my setup), if offsets are lost i.e. offsets.retention.minutes
has expired then the consumer would rebalance and start from latest committed offset (which wouldn't be anything) and any new incoming data would be processed as is. In this scenario there shouldn't be any data reprocessing and hence no duplicates.
In the case of a system time change however, there might be a possibility of offsets being inconsistent i.e. it is possible for offsets of source topic to have a CommitTime
of an earlier date after a CommitTime
of a later date. In this case if a topic has a low traffic and there is no data received on it for more than offsets.retention.minutes
then its offset would be no longer available and another topic with high traffic would have its offset in __consumer_offsets
topic.
How would the stream consumer behave in this scenario? Is there a chance of duplication in this scenario. I am really confused about it. Any help will be really appreciated.