Apache Flink vs Apache Spark as platforms for large-scale machine learning?

Question

Could anyone compare Flink and Spark as platforms for machine learning? Which is potentially better for iterative algorithms? Link to the general Flink vs Spark discussion: What is the difference between Apache Spark and Apache Flink?

Flink is a relatively young project and it's hard to compare this new promising framework with such a giant project as Spark. — Nikita, Apr 21 '15 at 19:12
I won't answer this question now because we will have a deeper look in the near future on both ML frameworks. For now I totally agree with @ipoteka. — Matthias Kricke, Apr 23 '15 at 08:56
You should check out Flink's recently created Machine Learning Library: http://ci.apache.org/projects/flink/flink-docs-master/libs/ml/. As you can see here, we've planned to do much more: http://goo.gl/h9Qmt3 — Robert Metzger, Apr 23 '15 at 10:02

score 25 · Accepted Answer · edited Jul 05 '16 at 17:26

Disclaimer: I'm a PMC member of Apache Flink. My answer focuses on the differences of executing iterations in Flink and Spark.

Apache Spark executes iterations by loop unrolling. This means that for each iteration a new set of tasks/operators is scheduled and executed. Spark does that very efficiently because it is very good at low-latency task scheduling (same mechanism is used for Spark streaming btw.) and caches data in-memory across iterations. Therefore, each iteration operates on the result of the previous iteration which is held in memory. In Spark, iterations are implemented as regular for-loops (see Logistic Regression example).

Flink executes programs with iterations as cyclic data flows. This means that a data flow program (and all its operators) is scheduled just once and the data is fed back from the tail of an iteration to its head. Basically, data is flowing in cycles around the operators within an iteration. Since operators are just scheduled once, they can maintain a state over all iterations. Flink's API offers two dedicated iteration operators to specify iterations: 1) bulk iterations, which are conceptually similar to loop unrolling, and 2) delta iterations. Delta iterations can significantly speed up certain algorithms because the work in each iteration decreases as the number of iterations goes on. For example the 10th iteration of a delta iteration PageRank implementation completes much faster than the first iteration.

Thank you for explanation! Do I understand correct that Flink can preserve state and operators on worker nodes between iterations? Does this means potentially less overhead on iteration as compared to Spark that sends the task each new iteration? — Alexander, Apr 23 '15 at 21:13
Yes, Flink will keep the operators on the workers running (so you can keep state easily between iterations). And with that, save time on redeploying the tasks for each iteration. In particular with the delta iteration feature Fabian mentioned, iterations (on small parts of the data) that only run for a few seconds are possible. — Robert Metzger, Apr 23 '15 at 21:26
Sounds good! How long is the fixed overhead in Flink for each iteration? Order of 0.1 sec? better? Assume that algorithm is doing nothing, just iterates. — Alexander, Apr 23 '15 at 23:03
Good question. I'm not aware of a benchmark that tried to exactly measure that. Should depend on the scale-out (number of parallel tasks) and the amount of data that runs through the iterations. — Fabian Hueske, Apr 23 '15 at 23:20

LeoZhang · Answer 2 · 2019-02-25T01:28:00.067

From my experience on ML and data stream processing. Flink and Spark are good at different fields and they can be complementary for each other in ML scenarios. Flink is competent with online learning task in which we keep updating the partial model by consuming new events while doing inference both in real-time. And the partial model can also merge the pre-trained model built on the history data offline by Spark.

Apache Flink vs Apache Spark as platforms for large-scale machine learning?

2 Answers2

Linked