Recently, we had a production incident when Kafka consumers were repeatedly processing the same Kafka records again and again, and Kafka was rebalancing all the time. But I do not want to write here about this issue - we resolved it (by lowering the max-poll-records
) and it works fine, now.
But the incident made me wonder - could we have lost some messages during this incident?
For instance: The documentation for auto-offset-reset
says that this parameter applies "...if an offset is out of range". According to Kafka auto.offset.reset query it may happen e.g. "if the Consumer offset is less than the smallest offset". That is, if we had auto-offset-reset=latest
and topic cleanup was triggered during the incident, we could have lost all the unprocessed data in the topic (because the offset would be set to the end of the topic, in this case). Therefore, IMO, it is never a good idea to have auto-offset-reset=latest
if you need at-least-once delivery.
Actually, there are plenty of other situations where there is a threat of data loss in Kafka if not everything is set up correctly. For instance:
- When the schema registry is not available, messages can get lost: How to avoid losing messages with Kafka streams
- After application restart, unprocessed messages are skipped despite that
auto-offset-reset=earliest
. We had this problem too in a topic (=not in every topic). Perhaps this is the same case. - etc.
Is there a cook-book how to set everything related to Kafka properly in order to make the application robust (with respect to Kafka) and prevent data loss? We've set up everything we consider important, but I'm not sure that we haven't overlooked something. And I cannot imagine all bad things that are possible in order to prevent them. For instance:
- We have Kafka consumers with the same groupId running in different (geographically separated) networks. Does it matter? Nowadays probably not, but in the past probably yes, according to this answer.