14

We have discussed the questions below:

But Spark Structured Streaming was added at Spark2.2, it brings a lot of changes for streaming, and it is outstanding.

Can we say Spark Strutured Streaming is a streaming processing, or still batch processing?

Now what is the big difference between Apache Flink and Apache Spark Structured Streaming?

javamonkey79
  • 17,443
  • 36
  • 114
  • 172
ShuMing Li
  • 153
  • 1
  • 8

1 Answers1

8

Currently:

Spark Structured Streaming has still microbatches used in background. However, it supports event-time processing, quite low latency (but not as low as Flink), supports SQL and type-safe queries on the streams in one API; no distinction, every Dataset can be queried both with SQL or with typesafe operators. It has end-to-end exactly-one semantics (at least they says it ;) ). The throughput is better than in Flink (there were some benchmarks with different results, but look at Databricks post about the results).

In near future:

Spark Continous Processing Mode is in progress and it will give Spark ~1ms latency, comparable to those from Flink. However, as I said, it's still in progress. The API is ready for non-batch jobs, so it's easier to do than in previous Spark Streaming.

The main difference:

Spark relies on micro-batching now and Flink is has pre-scheduled operators. That means, Flink's latency is lower, but Spark Community works on Continous Processing Mode, which will work similar (as far as I understand) to receivers.

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • I would not say it supports "full" event-time as the only watermark you can provide is with a static lag. – Dawid Wysakowicz Sep 01 '17 at 11:21
  • 3
    Also the claim that the throughput is better is not true. See e.g. this slide: https://www.slideshare.net/JamieGrier/extending-the-yahoo-streaming-benchmark-mapr-benchmarks/36 Flink also can achieve throughputs > 70M msgs/sec. In the post you provided they did not explain any of their setup so I would not believe in any of those number. – Dawid Wysakowicz Sep 01 '17 at 11:42
  • Databricks provided info on which benchmark they tested it. So it's fully reproducible. They even provided info which instances they have used. So I would belive them, not the benchmark run on custom nodes without any change to replicate – T. Gawęda Sep 01 '17 at 11:45
  • 1
    They used AWS, you can simply rerumn those Tests. Flink's benchmark is done on custom environment, without chanse to reproduce it. Sorry, I can't trust them if they make benchmark that is not reproducible - reproducibility should be the first point of the benchmark – T. Gawęda Sep 01 '17 at 11:46
  • AWS have ~10Gi network, so we should use 15M result of Flink, not the best one if we want to compare those results – T. Gawęda Sep 01 '17 at 11:49
  • Down vote without comments - why, I told you why Flink's benchmarks are not reproducible and what they did to better performance. They used betrer network, on similar they are slower – T. Gawęda Sep 18 '17 at 21:48
  • Sorry I forgot to submit my reasoning, and to be honest, upon rereading the answer and comments I cannot recall what I thought was missing. So I will retract my down vote (when I can edit my vote in a few hours). Apologies. – Jicaar Sep 19 '17 at 16:34
  • @Jicaar No problem :) And thank you for the up vote. I really want to write only true and reproducible data, as my academic backkground requires :) One sentence was already changed, because it was not very strict, but now I think I wrote all I can wrote without doubts that someone was cheating. If you got some info, that will help - sure, please provide and I will use it :) But as I wrote, I require only reproducible and stable info :) – T. Gawęda Sep 19 '17 at 16:39
  • 1
    The only think I can think to add is its worth noting that the tests were run on clusters of different processor types. The Spark job is ran on "10 r3.xlarge machines" which according to the aws instance types, that is 10 "Intel Xeon E5-2670 v2 (Ivy Bridge) Processors". The flink job is ran on 10 "Xeon E3-1230-V2" processors (don't see on AWS instance types). According to Intel, Xeon E5-2670 v2 is 10 cores (and a lot more expensive). So the cluster differences are extremely different, and Spark said in the video they just took the benchmark Flink provided which uses a less powerful cluster. – Jicaar Sep 19 '17 at 18:05
  • 1
    Also, it sounds like the Flink job had to use processing power to generate the data and then consume it, unlike the spark job. So that could be a factor as well. And I am not an expert on what in a Processor is more relevant to affect these tests, but the differences are enough to make the benchmarking comparison between the technologies misleading at best. Here are the links I found to the two processor types: http://ark.intel.com/products/65732/Intel-Xeon-Processor-E3-1230-v2-8M-Cache-3_30-GHz https://ark.intel.com/products/75275/Intel-Xeon-Processor-E5-2670-v2-25M-Cache-2_50-GHz – Jicaar Sep 19 '17 at 18:13
  • @Jicaar - thanks. I will edit my answer then, but probably tomorrow ;) Yeah, I didn't spot this difference – T. Gawęda Sep 19 '17 at 18:15
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/154824/discussion-between-t-gaweda-and-jicaar). – T. Gawęda Sep 19 '17 at 20:48
  • Ok, we have a clear message from Databricks: https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html – T. Gawęda Oct 11 '17 at 16:40
  • While Flink offers exactly-once guarantees, Spark's Structured Processing offers only at-least-once fault-tolerance guarantees, as opposed to Spark Streaming which does offer exactly-once guaranties but does not scale that good and imposes higher latency due to batch processing. – Hermes May 21 '19 at 16:12
  • @Hermes Could you please write any reference that says SSP uses at-least-once? Because in the most of the modes, it uses exactly-once, so I'm quite confused what do you mean. – T. Gawęda May 22 '19 at 09:25
  • @T.Gawęda Sorry for being imprecise. You can get exactly-once guarantees instead of the default at-least-once, as long as you are not using Kafka Sink and are OK with higher latency. "Continuous processing is a new, experimental streaming execution mode introduced in Spark 2.3 that enables low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees. Compare this with the default micro-batch processing engine which can achieve exactly-once guarantees but achieve latencies of ~100ms at best." https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html – Hermes Jun 03 '19 at 14:10