How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

Question

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API? looking for a more declarative way

Example:

select count(*) from some_view

I want the output to just count whatever records are available in each batch but not aggregate from the previous batch

score 1 · Answer 1 · answered May 07 '18 at 12:08

1

To do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API, you can use following code-

import spark.implicits._

def countValues = (_: String, it: Iterator[(String, String)]) => it.length

val query =
  dataStream
    .select(lit("a").as("newKey"), col("value"))
    .as[(String, String)]
    .groupByKey { case(newKey, _) => newKey }
    .mapGroups[Int](countValues)
    .writeStream
    .format("console")
    .start()

Here what we are doing is-

We added one column to our datastream - newKey. We did this so that we can do a groupBy over it, using groupByKey. I have used a literal string "a", but you can use anything. Also, you need to select anyone column from the available columns in datastream. I have selected value column for this purpose, you can select anyone.
We created a mapping function - countValues, to count the values aggregated by groupByKey function by writing it.length.

So, in this way, we can count whatever records are available in each batch but not aggregating from the previous batch.

I hope it helps!

answered May 07 '18 at 12:08

himanshuIIITian

5,985
6
50
70

I am looking for a declarative way as stated in the questions so I am trying to solve the problem using raw sql string which implicitly means no map functions unless they can be used as part of raw SQL! – user1870400 May 07 '18 at 12:31
@user1870400 I am not familiar with any declarative way. – himanshuIIITian May 07 '18 at 12:38
if you choose literal "a" wouldn't the entire stream becomes one group? – user1870400 May 21 '18 at 10:54
@user1870400 Yes, it does. – himanshuIIITian May 21 '18 at 11:19
is there a way to create a static dataframe inside mapGroups ? Given that mapGroups gives an Iterator of rows. I just want to take that iterator and populate a static dataframe. is that possible? – user1870400 Jun 04 '18 at 12:24

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

1 Answers1

Linked