Spark structured streaming multiple aggregation options

Question

In my use case , I need to do multiple aggregations in spark structured streaming. Though directly it is not supported so far till 2.4.x but have seen this thread(Multiple aggregations in Spark Structured Streaming)

So far in my understanding, we have two options to achieve that :

I do my first aggregation and then store that result using either "foreach" or "foreachbatch" in some temporary store and then read it again to do second aggregation. This step will involve writing to some external storage/memory and may not be very efficient.
Second option as mentioned in thread is(Multiple aggregations in Spark Structured Streaming) it to use "flatMapGroupWithState". This looks promising to solve it but not sure about its performance implication, as writing this method may involve shuffling(not sure whether we can optimize shuffling in this method) .

What is the best option out of these two to achieve multiple aggregation in spark structured streaming specially in terms of performance ?

for what we're trying to do with structured streaming, performance was *less* of a concern so we are trying this way: 1. performing an aggregation (`group by` and summations of columns over dates) 2. saving this as an "intermediate" dataframe as parquet to disk in a cluster 3. loading it again (with `writeStream` this time) from disk 4. performing the second aggregation, creating a new timestamp column from literal string 4. using `writeStream`, .withWatermark()`, and `forEach(writer)` to write as `append` mode to Redis — 123, Jun 14 '22 at 22:57

Spark structured streaming multiple aggregation options

0 Answers0