Spark Structured Streaming - Limitations? (Source Performance, Unsupported Operations, Spark UI)

Question

I have started to explore Spark Structured Streaming to write some applications having been using DStreams before this.

I am trying to understand the limitations of Structured Streaming as I have started to use it but would like to know the draw backs if any.

Q1. For each sink in the structured streaming app, it will independently read from a source (eg. Kafka). Meaning if you read from one topic A, and write to 3 places (e.g. ES, Kafka, S3) it will actually set up 3 source connections independent of each other.

Will this be a performance degradation? As it will require 3 independent connections managed instead of one (DStream approach)

Q2. I know that joining 2 streaming data sets is unsupported. How can I perform calculations on 2 streams?

If I have data from topic A and other data from topic B, is it possible to do calculations on both of these somehow?

Q3. In Spark Streaming UI, there is a Streaming tab for metrics and to view the throughput of the application. In Structured Streaming this is not available anymore.

Why is this? Is the intention to obtain all metrics programmatically and push to a separate monitoring service?

score 8 · Accepted Answer · answered May 23 '18 at 08:43

For each sink in the structured streaming app, it will independently read from a source (eg. Kafka). Meaning if you read from one topic A, and write to 3 places (e.g. ES, Kafka, S3) it will actually set up 3 source connections independent of each other.

Out of the box, yes. Each Sink describes a different execution flow. But, you can get around this by not using built-in sinks and creating your own custom one, which controls how you do the writes.

Will this be a performance degradation? As it will require 3 independent connections managed instead of one (DStream approach)

Probably. You usually don't want to read and process the same thing over and over again only because you have more than one Sink for the same source. Again, you can accommodate this by building your own Sink (which shouldn't be too much work)

Q2. I know that joining 2 streaming data sets is unsupported. How can I perform calculations on 2 streams?

As of Spark 2.3, this is supported OOTB.

Q3. In Spark Streaming UI, there is a Streaming tab for metrics and to view the throughput of the application. In Structured Streaming this is not available anymore. Why is this? Is the intention to obtain all metrics programmatically and push to a separate monitoring service?

You're right. The fancy UI you had in Structured Streaming doesn't exist (yet) in Spark. I've asked Michael Armburst this question and his answer was "priorities", they simply haven't had the time to put in work to create something as fancy as Spark Streaming had because they wanted to squeeze more features in. The good thing about OSS is you can always contribute the missing part yourself if you need it.

All in all, Structured Streaming is the future for Spark. No more work is being invested in DStreams. For high throughput systems, I can say there is a significant benefit in joining the Structured Streaming bandwagon. I've switched over once 2.1 was released and it is definitely a performance boost, especially in the areas of stateful streaming pipelines.

score 2 · Answer 2 · 2018-05-23T08:40:55.767

TL;DR; Structured streaming are designed with different goals in mind, and although we tend to call DStream "legacy", there are not drop-in replacement. Comparing them is somewhat meaningless, because as they evolve, they diverge more and more from the original Spark model.

DStreams can be used to implement a lot of Structured Streaming features (just check Apache Beam), but it is far from trivial.

At the same time differences reiterate what we know from RDD vs. DataFrame discussion - you get highly expressive, optimized, and concise API, at the cost of generality and freedom (not every problem can be framed in SQL - like API).

Also, it is still fairly new, and mostly experimental, so some features might not be implemented yet.

Will this be a performance degradation? As it will require 3 independent connections managed instead of one (DStream approach)

Under normal conditions it will improve performance, and excluding legacy receiver based sources, it is not different from Kafka DStreams.
I know that joining 2 streaming data sets is unsupported

There is a wide range of supported streaming joins. See Support matrix for joins in streaming queries.
n Structured Streaming this is not available anymore.

Spark UI provides a wide set of data through UI. Check just Monitoring Structured Streaming Applications Using Web UI by Jacek Laskowski

Spark Structured Streaming - Limitations? (Source Performance, Unsupported Operations, Spark UI)

2 Answers2