How can I achieve streaming data aggregation per batch using Spark Structured Streaming?

Question

I am using Spark Structured Streaming to read from a bunch of files coming into my system to a specific folder.

I want to run a streaming aggregation query on the data and write the result to Parquet files every batch, using Append Mode. This way, Spark Structured Streaming performs a partial aggregation intra-batch that is written to disk and we read from the output Parquet files using a Impala table that points to the output directory. So I need to have something like this:

batch        aggregated_value
batch-1          10
batch-2           8
batch-3          17
batch-4          13

I actually don't need the batch column but it helps to clarify what I am trying to do.

Does Structured Streaming offer a way to achieve this?

Have you read [the guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)? Did you try `groupBy` in Append output mode? — Jacek Laskowski, Jan 28 '19 at 17:56
Hello Jacek, thanks for your reply. I have read the guide and tried different things so far. I cannot skip the watermark mechanism so that the aggregations are written to disk from the very first batch. Is that possible? — messenjah00, Jan 29 '19 at 18:16
Hello again, I found a piece of code that does what I need, but I'm not sure about how it will perform and if I can add whatever aggregation logic I need. The code uses a `KeyValueGroupedDataset` and `flatMapGroups`, like this: `val groupedDS = ds .groupByKey(id) .flatMapGroups{ case (id, iter) => Iterator((id, iter.length))} .toDF("id", "count") .writeStream .format("console") .outputMode("append") .start() .awaitTermination()` Then just change from console to file sink. What's your view on this? — messenjah00, Jan 30 '19 at 09:30
Looks very promising. I must admit I've never used `flatMapGroups` before, but think I should change it. — Jacek Laskowski, Jan 30 '19 at 10:24

How can I achieve streaming data aggregation per batch using Spark Structured Streaming?

0 Answers0

Linked