What's the best way to push kafka messages from my edge nodes?

Question

I have a worker in the primary region (US-East) that computes data on traffic at our edge locations. I want to push the data from an edge region to our primary kafka region.

An example is Poland, Australia, US-West. I want to push all these stats to US-East. I don't want to encurr additional latency during the writes from the edge regions to the primary.

Another option is to create another kafka cluster and worker that acts as a relay. That would require us to maintain individual clusters in each region and would add a lot more complexity to our deployments.

I've seen Mirror Maker, but I don't really want to Mirror anything, I guess I'm looking more for a relay system. If this isn't the designed way to do this, how can I aggregate all of our application metrics to the primary region to be computed and sorted?

Thank you for your time.

Bit of a clarification, are you looking for something that you can run on your edge node(s) in order to publish messages back to a central Kafka cluster? or are you looking for something to act in a more central manner that could reach out to each edge node, ask for an update and then publish those updates to the Kafka cluster? — JDP10101, Nov 10 '16 at 20:51

score 1 · Accepted Answer · answered Nov 14 '16 at 04:40

As far as I know, here are your options:

Setup a local Kafka cluster in each region and have your edge nodes write to the their local Kafka cluster for low latency writes. From there, you would setup a mirror maker that pulls data from your local Kafka to your remote Kafka for aggregation.
If you're concerned with interrupting your applications request path with high latent blocking requests, then you may want to configure your producers to write asynchronously (non-blocking) to your remote Kafka cluster. Depending on your programming language choice, this could be simple or complex exercise.
Run a per host relay (or data buffer) service that could be as simple as a log file and daemon that pushes to your remote Kafka cluster (as mentioned above). Alternatively, run a single instance Kafka / Zookeeper container (there are docker images that bundle both together) that buffers the data for downstream pulling.

Option 1. is definitely the most standard solution to this problem, albeit a bit heavy handed. I suspect there will be more tooling coming out Confluent / Kafka folks to support option 3. in the future.

score 0 · Answer 2 · answered Nov 11 '16 at 08:55

Write the messages to a local logfile on disk. Write a small daemon which reads the logfile and pushes the events to the main kafka daemon.

To increase througput and limit the effect of latency you could also rotate the logfile every minute. Then rsync the logfile with a cronjob to your main kafka region minutely. Let the import daemon run there.

What's the best way to push kafka messages from my edge nodes?

2 Answers2