0

I am using Spark Structured Streaming to read from a bunch of files coming into my system to a specific folder.

I want to run a streaming aggregation query on the data and write the result to Parquet files every batch, using Append Mode. This way, Spark Structured Streaming performs a partial aggregation intra-batch that is written to disk and we read from the output Parquet files using a Impala table that points to the output directory. So I need to have something like this:

batch        aggregated_value
batch-1          10
batch-2           8
batch-3          17
batch-4          13

I actually don't need the batch column but it helps to clarify what I am trying to do.

Does Structured Streaming offer a way to achieve this?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
messenjah00
  • 53
  • 1
  • 9
  • Have you read [the guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)? Did you try `groupBy` in Append output mode? – Jacek Laskowski Jan 28 '19 at 17:56
  • Hello Jacek, thanks for your reply. I have read the guide and tried different things so far. I cannot skip the watermark mechanism so that the aggregations are written to disk from the very first batch. Is that possible? – messenjah00 Jan 29 '19 at 18:16
  • I think it is not possible. – Jacek Laskowski Jan 29 '19 at 21:13
  • Hello again, I found a piece of code that does what I need, but I'm not sure about how it will perform and if I can add whatever aggregation logic I need. The code uses a `KeyValueGroupedDataset` and `flatMapGroups`, like this: `val groupedDS = ds .groupByKey(id) .flatMapGroups{ case (id, iter) => Iterator((id, iter.length))} .toDF("id", "count") .writeStream .format("console") .outputMode("append") .start() .awaitTermination()` Then just change from console to file sink. What's your view on this? – messenjah00 Jan 30 '19 at 09:30
  • Looks very promising. I must admit I've never used `flatMapGroups` before, but think I should change it. – Jacek Laskowski Jan 30 '19 at 10:24
  • 1
    Ok, I'll update here if I succeed – messenjah00 Jan 30 '19 at 13:06

0 Answers0