4

I'm new to Flume and thinking to use Flume in the below scenario.

Our system receives events as HTTP POST, and we need to store a copy of them in Kafka (for further processing) and another copy in HDFS (as permanent store).

Can we configure Flume source as HTTP, channel as KAFKA, sink as HDFS to meet our requirement. Will this solution works ?

1 Answers1

0

If I've understood well, you want Kafka as the final backend were to store the data, not as the internal channel used by a Flume agent to communicate both source and sink. I mean, a Flume agent is basically compose of a source receiving data and building Flume events that are put into a channel in order a sink reads those events and do something with them (typically, persisting this data in a final backend). Thus, according yo your design, if you use Kafka as the internal channel, it will be that, an internal way of communicating the HTTP source and the HDFS sink; but it never will be accessible from outside the agent.

In order to meet your needs, you will need and agent such as:

http_source -----> memory_channel -----> HDFS_sink ------> HDFS
            |
            |----> memory_channel -----> Kafka_sink -----> Kafka

{.................Flume agent.....................}       {backend}

Please observe the memory-based channels are the internal ones, they can be based on memory, or files, even in Kafka, but that Kafka channels will be different than the final Kafka you will be persisting the data and that will be accessible by your app.

frb
  • 3,738
  • 2
  • 21
  • 51
  • Thanks for the clarification. Agree with your comments of using two sinks. One thing i didn't understand is 'but it never will be accessible from outside the agent.' As we are providing the Kafka cluster, this would be accessible to outside of the agent, right ? Please clarify. – Hemanth Abbina Sep 22 '15 at 17:28
  • Ummm, you are right. Since the Kafka channel is based on a cluster of your own, it is perfectly accessible. Nevertheless, that would be a rare use of Flume, I mean, you could avoid the second sink and directly get access to the data within this internal channel based on Kafka (at the same time, the HDFS sink will be consuming this data for permament storage purposes). But I would prefer to have 2 memory-based internal channels and consider Kafka as a final backend, instead of a internal channel. I'll fix my answer. – frb Sep 23 '15 at 06:50
  • In fact you can very well use Kafka as a channel and have consumer retrieve messages from the channel's topic. Kafka handles offsets individually (depending on the type of client) and only deletes data once the message's TTL is reached. – Erik Schmiegelow Sep 29 '15 at 15:16