Reliable fire-n-forget Kafka producer implementation strategy

Question

I'm in middle of a 1st mile problem with Kafka. Everybody deals with partitioning, etc. but how to handle the 1st mile?

My system consists of many applications producing events distributed on nodes. I need to deliver these events to a set of applications acting as consumers in a reliable/fail-safe way. The messaging system of choice is Kafka (due its log nature) but it's not set in stone.

The events should be propagated in a decoupled fire-n-forget manner as most as possible. This means the producers should be fully responsible for reliable delivering their messages. This means apps producing events shouldn't worry about the event delivery at all.

Producer's reliability schema has to account for:

box connection outage - during an outage producer can't access network at all; Kafka cluster is thus not reachable
box restart - both producer and event producing app restart (independently); producer should persist in-flight messages (during retrying, batching, etc.)
internal Kafka exceptions - message size was too large; serialization exception; etc.

No library I've examined so far covers these cases. Is there a suggested strategy how to solve this?

I know there are retriable and non-retriable errors during Producer's send(). On those retriable, the library usually handles everything internally. However, non-retriable ends with an exception in async callback...

Should I blindly replay these to infinity? For network outages it should work but how about Kafka internal errors - say message too large. There might be a DeadLetterQueue-like mechanism + replay. However, how to deal with message count...

About the persistence - a lightweight DB backend should solve this. Just creating a persistent queue and then removing those already send/ACKed. However, I'm afraid that if it was this simple it would be already implemented in standard Kafka libraries long time ago. Performance would probably go south.

Seeing things like KAFKA-3686 or KAFKA-1955 makes me a bit worried.

Thanks in advance.

Amit Kumar · Answer 1 · 2017-05-31T17:21:10.270

We have a production system whose primary use case is reliable message delivery. I can't go in much detail, however i can share a high level design on how we achieve this. However this system is guarantees "atleast once delivery" messaging sematics.

Source

First we designed a message schema, and all the message sent to this system must follow it.
Then we write the message to the a mysql message table, which is sharded by date, with a field marked as delivered or not
We have a app constantly polling db, with rows marked un-delivered, picks up a row, constructs the message and send it to the load balancer, this is a blocking call and updates the message row to delivered, only when returned 200 In case of 5xx, the app will retry the message with sleep back off. Also you can make the retries configurable as per your need.

Each source system maintains their own polling app and db.

Producer Array

This is basically a array of machines under a load balancer waiting for incoming messages and produce those to the Kafka Cluster.
We maintain 3 replicas of each topic and in the producer Config we keep acks = -1 , which is very important for your fire-n-forget requirement. As per the doc

acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting

As I said producing is a blocking call, and it will return 2xx if the message is produced succesfully across all 3 replicas. 4xx, if message is doesn't meet the schema requirements 5xx, if the kafka broker threw some exception.

Consumer Array

This is a normal array of machines, running Kafka High level Consumers for the topic's consumer groups.

We are currently running this setup with few additional components for some other functional flows in production and it is basically fire-n-forget from the source point of view.

This system addresses all of your concerns.

box connection outage : Unless the source polling app gets 2xx,it will produce again-again which may lead to duplicates.
box restart : Due to retry mechanism of the source , this shouldn't be a problem as well.
internal Kafka exceptions : Taken care by polling app, as producer array will reply with 5xx unable to produce, and will be further retried.

Acks = -1, also ensures that all the replicas are in-sync and have a copy of the message, so broker going down will not be a issue as well.

Thanks for reply. However, I can see a weak point in the MySQL part. What happens to the events when the source MySQL dies, has to be maintained, upgraded, etc.? — Yuri, Jun 01 '17 at 06:58
@Yuri, MySQL need to be maintained, however a master, slave set should be good enough... but of all data stores mysql requires least maintainence. — Amit Kumar, Jun 01 '17 at 07:15

Reliable fire-n-forget Kafka producer implementation strategy

1 Answers1