3

I'm working with these two real time data stream framework processor. I've searched everywhere but I can't find big difference between these two framework. In particular I would like to know how they work based on size of data or topology etc.

Dinesh Shingadiya
  • 988
  • 1
  • 8
  • 23
  • 1
    This seems to be a duplicate of https://stackoverflow.com/questions/55964790/difference-between-apache-storm-and-flink – Fabian Hueske May 03 '19 at 08:15
  • I don't understand your comment. – Marco Domenicano May 03 '19 at 08:29
  • Isn't your question a duplicate of the one I linked ("What is/are the main difference(s) between Flink and Storm?")? If it is not, it would be good to rephrase the question title to point out the difference. – Fabian Hueske May 03 '19 at 08:37
  • @FabianHueske I guess you mixed the links. This seems to be the duplicate: https://stackoverflow.com/questions/30699119/what-is-are-the-main-differences-between-flink-and-storm/30719138#30719138 – TobiSH May 03 '19 at 09:38
  • Possible duplicate of [What is/are the main difference(s) between Flink and Storm?](https://stackoverflow.com/questions/30699119/what-is-are-the-main-differences-between-flink-and-storm) – TobiSH May 03 '19 at 09:39
  • You are right, @TobiSH. Sorry for that :-( – Fabian Hueske May 03 '19 at 09:48
  • 1
    Exactly @TobiSH i don't understand Fabian post because it's link report to my post. My question is little bit different and you can see these difference in the answer of jbx. Moreover the other one is 3 years older and answer may be different. – Marco Domenicano May 03 '19 at 11:49

1 Answers1

2

The difference is mainly on the level of abstraction you have on processing streams of data.

Apache Storm is a bit more low level, dealing with the data sources (Spouts) and processors (Bolts) connected together to perform transformations and aggregations on individual messages in a reactive way.

There is a Trident API that abstracts a little from this low level message driven view, into more aggregated query like constructs, which makes things a bit easier to integrate together. (There is also an SQL-like interface for querying data streams, but it is still marked as experimental.)

From the documentation:

TridentState wordCounts =
     topology.newStream("spout1", spout)
       .each(new Fields("sentence"), new Split(), new Fields("word"))
       .groupBy(new Fields("word"))
       .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))                
       .parallelismHint(6);

Apache Flink has a more functional-like interface to process events. If you are used to the Java 8 style of stream processing (or to other functional-style languages like Scala or Kotlin), this will look very familiar. It also has a nice web based monitoring tool. The nice thing about it is that it has built-in constructs for aggregating by time windows etc. (Which in Storm you can probably do too with Trident).

From the documentation:

 DataStream<WordWithCount> windowCounts = text
            .flatMap(new FlatMapFunction<String, WordWithCount>() {
                @Override
                public void flatMap(String value, Collector<WordWithCount> out) {
                    for (String word : value.split("\\s")) {
                        out.collect(new WordWithCount(word, 1L));
                    }
                }
            })
            .keyBy("word")
            .timeWindow(Time.seconds(5), Time.seconds(1))
            .reduce(new ReduceFunction<WordWithCount>() {
                @Override
                public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                    return new WordWithCount(a.word, a.count + b.count);
                }
            });

When I was evaluating the two, I went with Flink, simply because at that time it felt more well documented and I got started with it much more easily. Storm was slightly more obscure. There is a course on Udacity which helped me understand it much more, but in the end Flink still felt more fit for my needs.

You might also want to look at this answer here, albeit a bit old so both projects must have evolved since then.

jbx
  • 21,365
  • 18
  • 90
  • 144
  • Thank you for your answer. I'm doing my master's thesis and i have to analyze these two framework and compare an application over them and so i was loooking to one that can be compared. Can you explain to me also how about difference between size of cluster? Do you know a realtime stream that i can use? Thanks – Marco Domenicano May 03 '19 at 07:32
  • By size of cluster you mean how many nodes you can have? Not sure about the scalability properties, I guess the best thing would be to build the same application on both and benchmark them with the same hardware. You could look at things like Twitter feeds on a number of keywords with their API, or aggregating price changes of stock prices, or the same thing for changes in crypto-currency prices (the APIs of these exchanges tend to be quite good since they are used by bots). You could compute all the moving averages and indicators as a showcase of real-time data aggregation. – jbx May 03 '19 at 07:38