I'm in middle of a 1st mile problem with Kafka. Everybody deals with partitioning, etc. but how to handle the 1st mile?
My system consists of many applications producing events distributed on nodes. I need to deliver these events to a set of applications acting as consumers in a reliable/fail-safe way. The messaging system of choice is Kafka (due its log nature) but it's not set in stone.
The events should be propagated in a decoupled fire-n-forget manner as most as possible. This means the producers should be fully responsible for reliable delivering their messages. This means apps producing events shouldn't worry about the event delivery at all.
Producer's reliability schema has to account for:
- box connection outage - during an outage producer can't access network at all; Kafka cluster is thus not reachable
- box restart - both producer and event producing app restart (independently); producer should persist in-flight messages (during retrying, batching, etc.)
- internal Kafka exceptions - message size was too large; serialization exception; etc.
No library I've examined so far covers these cases. Is there a suggested strategy how to solve this?
I know there are retriable and non-retriable errors during Producer's send()
. On those retriable, the library usually handles everything internally. However, non-retriable ends with an exception in async callback...
Should I blindly replay these to infinity? For network outages it should work but how about Kafka internal errors - say message too large. There might be a DeadLetterQueue-like mechanism + replay. However, how to deal with message count...
About the persistence - a lightweight DB backend should solve this. Just creating a persistent queue and then removing those already send/ACKed. However, I'm afraid that if it was this simple it would be already implemented in standard Kafka libraries long time ago. Performance would probably go south.
Seeing things like KAFKA-3686 or KAFKA-1955 makes me a bit worried.
Thanks in advance.