1

I have a schema -

 |-- record_id: integer (nullable = true)
 |-- Data1: string (nullable = true)
 |-- Data2: string (nullable = true)
 |-- Data3: string (nullable = true)
 |-- Time: timestamp (nullable = true)

I want to know the record for each record id with latest timestamp. I have not been able to do this in structured streaming. In Spark Streaming, I have achieved this on each incoming batch by using foreachRDD, and converting each incoming RDD to a dataframe and then running my sql query on it.

However, this yields results only on each new RDD, and not using the whole history. I know I can do this in Spark Streaming using Key Value Pairs, but I'm rather interested in running SQL queries on whole of the History (group by, joins and such). How can I do it in Spark Streaming, and not in Spark Structured Streaming ? Another reason I cant do this in structured streaming is because I cant use Streaming Aggregation before Joins, which is kind of what I require for this.

Mark B.
  • 329
  • 2
  • 16

0 Answers0