1

I built an application for searching similar images stores in distributed environment using Hadoop. But Hadoop does not support real time processing, that why the response time is long. I know that Storm is another framework for big data analysis application. But I got confused whether we can use Storm to implement this kind of application.

Does anybody give an advice what kind of application that use efficiently Storm framework.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
ndk076
  • 165
  • 12

1 Answers1

1

Storm is a very scalable, fast, fault-tolerant open source system for distributed computation, with a special focus on stream processing. Storm excels at event processing and incremental computation, calculating rolling metrics in real time over streams of data

Event stream processing is major strength of Storm.

Generally Hadoop is used for batch-processing. But Storm is The Hadoop of real-time processing and Spark is Distributed processing for all with in-memory data store

Have a look at this Storm and Spark and Stack Comparison links

enter image description here

EDIT:

My solution for this problem

1) Store the images in CMS (content management system) with CDN spread across multiple networks and not in HDFS or NoSQL database)

2) Store the Image Id, Image Name, MD5SUM, Image Location meta information in HBase table

3) Use Spark & HBase for image data processing e.g. remove duplicate image by checking MD5SUM

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • Thanks for your answer and great links! I still can not make clear that is Storm a suitable solution for searching similar images tasks with images stored in distributed computer? – ndk076 Oct 07 '15 at 12:00
  • I too. I prefer Spark to Storm. – Ravindra babu Oct 07 '15 at 12:01
  • **Disclaimer: I am a committer at Apache Flink** You might also consider https://flink.apache.org/ In contrast to Spark, it provides true streaming similar to Storm (and no micro-batching as Spark does) while Flink can also handle batch jobs. Compare: https://stackoverflow.com/questions/28082581/what-is-the-differences-between-apache-spark-and-apache-flink and https://stackoverflow.com/questions/30699119/what-is-are-the-main-differences-between-flink-and-storm – Matthias J. Sax Oct 07 '15 at 12:16
  • @ Matthias J. Sax and ravindra: thanks for you both answers. So I really confused that my current implementation for searching problem is batch-processing base on Hadoop, so whether or not I can use Flink ot Storm like a replaceable solution to move this problem into stream processing – ndk076 Oct 08 '15 at 07:40
  • That depends on you & your organization comfortability, Urgency of delivery and sponsorship to move to different technology etc. If nothing is possible, you will continue with status quo ( delay in processing with batch processing of hadoop) – Ravindra babu Oct 08 '15 at 07:48
  • yes, I have to change to anther technologies to reach real time or near real time for this task. basically, when user query image, my system will run and return the similar images for user. During this step, lot of computations were executed for comparison as well as I/O task for access collection of image stored in HDFS. – ndk076 Oct 08 '15 at 08:02
  • what i got confused here is that the above data flow can become streaming data?. My understanding is that Hadoop take a large a amount of data at one time to process and iterate this processing until completed. While Storm (Spark...) streaming mean that the data stream is continuous processed, but in this case the system run just only when user interactive with it, if i use Storm or Spark or some else technologies that support streaming process, I have to load data -> do some inner task -> return result. so intuitively, we can see that it will not improve the performance? – ndk076 Oct 08 '15 at 08:02
  • Still I think Spark+HBase (due to in-memory database) will be fast enough to cater to your requirements – Ravindra babu Oct 08 '15 at 08:06