I am using Spark Structured Streaming to read from a bunch of files coming into my system to a specific folder.
I want to run a streaming aggregation query on the data and write the result to Parquet files every batch, using Append Mode. This way, Spark Structured Streaming performs a partial aggregation intra-batch that is written to disk and we read from the output Parquet files using a Impala table that points to the output directory. So I need to have something like this:
batch aggregated_value
batch-1 10
batch-2 8
batch-3 17
batch-4 13
I actually don't need the batch column but it helps to clarify what I am trying to do.
Does Structured Streaming offer a way to achieve this?