Kafka - how to avoid losing data in emergency situations

Question

Recently, we had a production incident when Kafka consumers were repeatedly processing the same Kafka records again and again, and Kafka was rebalancing all the time. But I do not want to write here about this issue - we resolved it (by lowering the max-poll-records) and it works fine, now.

But the incident made me wonder - could we have lost some messages during this incident?

For instance: The documentation for auto-offset-reset says that this parameter applies "...if an offset is out of range". According to Kafka auto.offset.reset query it may happen e.g. "if the Consumer offset is less than the smallest offset". That is, if we had auto-offset-reset=latest and topic cleanup was triggered during the incident, we could have lost all the unprocessed data in the topic (because the offset would be set to the end of the topic, in this case). Therefore, IMO, it is never a good idea to have auto-offset-reset=latest if you need at-least-once delivery.

Actually, there are plenty of other situations where there is a threat of data loss in Kafka if not everything is set up correctly. For instance:

When the schema registry is not available, messages can get lost: How to avoid losing messages with Kafka streams
After application restart, unprocessed messages are skipped despite that auto-offset-reset=earliest. We had this problem too in a topic (=not in every topic). Perhaps this is the same case.
etc.

Is there a cook-book how to set everything related to Kafka properly in order to make the application robust (with respect to Kafka) and prevent data loss? We've set up everything we consider important, but I'm not sure that we haven't overlooked something. And I cannot imagine all bad things that are possible in order to prevent them. For instance:

We have Kafka consumers with the same groupId running in different (geographically separated) networks. Does it matter? Nowadays probably not, but in the past probably yes, according to this answer.

For making sure that you process a Kafka message only once and also make sure no data is lost, try custom offset management by saving the offset to a system like Zookeeper or HBase. https://blog.cloudera.com/offset-management-for-apache-kafka-with-apache-spark-streaming/ — Saif Ahmad, Jan 18 '22 at 14:43
Thank you, but this is not what I was asking about. The application is working just fine as it is. But there are tens of various settings which we left as defaults because we are not sure if we should mingle with them. Changing them may make things worse in various critical situations, though - at first sight - they may look desirable. This is exactly the case of `auto-offset-reset=latest` - you may need it for new consumers, but using it may result in data loss when things go wrong. I need a checklist of what to look at in order to make sure we won't loose the data. — xarx, Jan 20 '22 at 06:40

Kafka - how to avoid losing data in emergency situations

0 Answers0