I'm currently trying to design a scalable consumer architecture for kafka and i'm running into some issues with offset coordination. It is important for my use case that each message consumed by kafka is processed exactly once.
Take the following for an illustration of the problem:
- Consumer retrieves message from Kafka
- Consumer processes message (business logic, yay!)
- Consumer finishes processing, increments local offset
- Consumer attempts to commit offset back to kafka
- This network call fails for X reason
- The above error, or anything else, causes the consumer to crash before the offset commit can be retried
- System orchestrator brings up another consumer, which then fetches the outdated offset
- The same message is retrieved, and re-processed (bad!)
For those with more distributed systems experience than I, you've probably recognized that this is (more or less) the Two Generals problem applied to Kafka offset/work result coordination.
I've thought about committing the offset and the work result in a (probably SQL) db transaction but that ties those implementations together and also limits my data store options (also, what do I do if I move my data store to something without transactions?). Another possible solution would be hashing each message and using bloom filters to probabilistically prevent duplicate processing, but now we're starting to add complexity I'd preferably like to avoid.