Getting latest entry for each group in dataframe when using Structured Streaming

Asked Mar 26 '18 at 03:59

Active Oct 28 '18 at 14:37

Viewed 510 times

I've researched a little on this and found an answer for general Spark applications. However, in structured streaming, you cannot do a join between 2 streaming dataframes (therefore a self join is not possible) and sorting functions cannot be used as well. So is there a way to get the latest entry for each group at all? (I'm on Spark 2.2)

UPDATE: Assuming the dataframe rows are already sorted by time already, we can just take the last entry using for each required row using groupBy then agg with the pyspark.sql.functions.last function.

edited Oct 28 '18 at 14:37

zero323

322,348
103
959
935

asked Mar 26 '18 at 03:59

absolutelydevastated

1,657
1
11
28

[Spark 2.3](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) Structured Streaming is supporting some cases of joins. You may refer. I think your requirement can be met using the window functions with the constraints mentioned in the doc. – sujit Mar 26 '18 at 07:05
I'm actually running on Spark 2.2. Unfortunately. – absolutelydevastated Mar 26 '18 at 07:11

Getting latest entry for each group in dataframe when using Structured Streaming

0 Answers0