How to execute non-timestamp-based aggregations in spark structured streaming?

Question

Consider the following intended sql:

select row_number() 
  over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,* 
  from flights

This will not work in structured streaming - as shown in the following question Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets; by my own answer to that question: https://stackoverflow.com/a/55777253/1056563

The culprit is:

 partition by Origin

The requirement is to use a timestamp-typed field such as

 partition by flightTime

This requirement comes from a definitive source (core committer for spark streaming) - specifying that timestamp based aggregations are supported. The syntax in that case is using window:

window("timestamp", "10 minutes")`

There is actually an additional complication: Structured Streaming does not support correlated subqueries. Therefore the generally useful answers from the esteemed Gordon Linoff here: https://stackoverflow.com/a/46856508/1056563 can not be applied

What then for my query above - which must be based on the Origin field? What is the closest equivalent to that query? Or what would be a workaround or different approach to achieve same results?

I thought that windowed aggregations (`over` operator) are not supported at al (and that's what TD said in the answer you linked). So it's not about `partition by Origin` to be `partition by flightTime`, but whatever you want to achieve with the window aggregation you simply have to re-design the streaming query with some other algorithm. What are you really trying to do? — Jacek Laskowski, Apr 22 '19 at 01:43
The `partition` was intended for showing the intent - the actual syntax is `window`: I updated the question . — WestCoastProjects, Apr 22 '19 at 04:40
@JacekLaskowski I would *really* appreciate some insight on what ***other*** approaches might be possible here. I am *really* trying to do what is exactly in the query here -and it is *REALLY difficult to understand how to translate this to structured streaming. — WestCoastProjects, May 11 '19 at 22:34

How to execute non-timestamp-based aggregations in spark structured streaming?

0 Answers0

Linked