how is flume distributed?

Question

I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for flume installation and this post of SO.

How to install and configure apache flume?

http://flume.apache.org/FlumeUserGuide.html#setup

I am a newbie to flume and trying to understand more about it..Any help would be greatly appreciated. Thanks!!

score 1 · Accepted Answer · answered Jun 20 '14 at 15:27

I'm not going to speak to Cloudera's specific recommendations but instead to Apache Flume itself.

It's distributed however you decide to distribute it. Decide on your own topology and implement it.

You should think of Flume as a durable pipe. It has a source (you can choose from a number), a channel (you can choose from a number) and a sink (again, you can choose from a number). It is pretty typical to use an Avro sink in one agent to connect to an Avro source in another.

Assume you are installing Flume to gather Apache webserver logs. The common architecture would be to install Flume on each Apache webserver machine. You would probably use the Spooling Directory Source to get the Apache logs and the Syslog Source to get syslog. You would use the memory channel for speed and so as not to affect the server (at the cost of durability) and use the Avro sink.

That Avro sink would be connected, via Flume load balancing, to 2 or more collectors. The collectors would be Avro source, File channel and whatever you wanted (elasticsearch?, hdfs?) as your sink. You may even add another tier of agents to handle the final output.

score 1 · Answer 2 · edited Mar 31 '16 at 04:25

1

In the latest version, Apache Flume no longer follows master-slave architecture. It is deprecated after Flume 1.x.

There is no longer a Master, and no Zookeeper dependency. Flume now runs with a simple file-based configuration system.

If we want it to scale, we need to install it in multiple physical nodes and run our own topology. As far as single node is considered. Say we hook to a JMS server that gives 2000 XML events per second, and I need two Fulme agents to get that data, I have two distributed options:

Two Flume agents started and running to get JMS data in same physical node.
Two Flume agents started and running to get JMS data in two physical nodes.

edited Mar 31 '16 at 04:25

nhahtdh

55,989
15
126
162

answered Jun 20 '14 at 15:44

Muthu

265
1
6
16

If you are running the file channel, realise that it `fsync`'s for *every* batch. If you're running two agents in the same node with magnetic HDDs realise that the two channels will contend with each other in writing to the disk and it will be *slow*. – Sarge Jun 24 '14 at 12:59
For the 2nd case where 2 flume agents would run on two physical nodes, how would the coordination happen. That is suppose instead of JMS our source is kafka broker. Then how would the agents maintain coordination with respect to which all topic messages have been consumed so that not both the agent send same messages twice to the sink. – Adarsh Trivedi Mar 07 '18 at 05:20
Kafka actually does the coordination -- You need to ensure that you are using the same consumer group among all the flume agents. In that case Kafka will distribute the topics among the script and then the messages will be consumed based on the topic that is assigned. The offsets are also tracked on kafka -- so if it shuts down the next script will start from where the previous on shut down.. – DMin Mar 28 '19 at 10:47

how is flume distributed?

2 Answers2

Linked