run SQL queries on record history (aggregation) spark streaming

Asked Jun 20 '18 at 07:58

Active Jun 20 '18 at 08:03

Viewed 165 times

I have a schema -

 |-- record_id: integer (nullable = true)
 |-- Data1: string (nullable = true)
 |-- Data2: string (nullable = true)
 |-- Data3: string (nullable = true)
 |-- Time: timestamp (nullable = true)

I want to know the record for each record id with latest timestamp. I have not been able to do this in structured streaming. In Spark Streaming, I have achieved this on each incoming batch by using foreachRDD, and converting each incoming RDD to a dataframe and then running my sql query on it.

However, this yields results only on each new RDD, and not using the whole history. I know I can do this in Spark Streaming using Key Value Pairs, but I'm rather interested in running SQL queries on whole of the History (group by, joins and such). How can I do it in Spark Streaming, and not in Spark Structured Streaming ? Another reason I cant do this in structured streaming is because I cant use Streaming Aggregation before Joins, which is kind of what I require for this.

edited Jun 20 '18 at 08:03

asked Jun 20 '18 at 07:58

Mark B.

can you use Window functions in streaming ? – Steven Jun 20 '18 at 09:06
Non-time-based windows are not supported on streaming DataFrames/Datasets. – Mark B. Jun 20 '18 at 13:50

run SQL queries on record history (aggregation) spark streaming

0 Answers0