Druid is used for both real time and batch processing. But can it totally replace hadoop? If not why? As in what is the advantage of hadoop over druid? I have read that druid is used along with hadoop. So can the use of Hadoop be avoided?
-
I think your question should be rephrased (and you can draw that conclusion from what @nylon-smile wrote). See my answer below. – user766353 Aug 01 '14 at 22:16
2 Answers
We are talking about two slightly related but very different technologies here.
Druid is a real-time analytics system and is a perfect fit for timeseries and time based events aggregation.
Hadoop is HDFS (a distributed file system) + Map Reduce (a paradigm for executing distributed processes), which together have created an eco system for distributed processing and act as underlying/influencing technology for many other open source projects.
You can setup druid to use Hadoop; that is to fire MR jobs to index batch data and to read its indexed data from HDFS (of course it will cache them locally on the local disk)
If you want to ignore Hadoop, you can do your indexing and loading from a local machine as well, of course with the penalty of being limited to one machine.

- 8,990
- 1
- 25
- 34
Can you avoid using Hadoop with Druid? Yes, you can stream data in real-time into a Druid cluster rather than batch-loading it with Hadoop. One way to do this is to stream data into Kafka, which will handle incoming events and pass them into Storm, which can then process and load them into Druid Realtime nodes.
Typically this setup is used with Hadoop in parallel, because streamed real-time data comes with its own baggage and often needs to be fixed up and backfilled. That whole architecture has been dubbed "Lambda" by some.

- 528
- 5
- 23