The PySpark SQL functions reference on the row_number()
function says
returns a sequential number starting at 1 within a window partition
implying that the function works only on windows. Trying
df.select('*', row_number())
predictably gives a
Window function row_number() requires an OVER clause
exception.
Now, .over()
seems to work only with WindowSpec because
from pyspark.sql.functions import window, row_number
...
df.select('*', row_number().over(window('time', '5 minutes')))
gives a
TypeError: window should be WindowSpec
exception. However, according to this comment on the ASF Jira:
By time-window we described what time windows are supported in SS natively. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#types-of-time-windows
Window spec is not supported. This defines the boundary of window as non-timed manner, the offset(s) of the row, which is hard to track in streaming context.
WindowSpec is generally not supported in Structured Streaming. Leading to the conclusion that the row_number()
function is not supported in Structured Streaming. Is that correct? Just want to make sure I'm not missing anything here.