1

I've researched a little on this and found an answer for general Spark applications. However, in structured streaming, you cannot do a join between 2 streaming dataframes (therefore a self join is not possible) and sorting functions cannot be used as well. So is there a way to get the latest entry for each group at all? (I'm on Spark 2.2)

UPDATE: Assuming the dataframe rows are already sorted by time already, we can just take the last entry using for each required row using groupBy then agg with the pyspark.sql.functions.last function.

zero323
  • 322,348
  • 103
  • 959
  • 935
absolutelydevastated
  • 1,657
  • 1
  • 11
  • 28
  • [Spark 2.3](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) Structured Streaming is supporting some cases of joins. You may refer. I think your requirement can be met using the window functions with the constraints mentioned in the doc. – sujit Mar 26 '18 at 07:05
  • I'm actually running on Spark 2.2. Unfortunately. – absolutelydevastated Mar 26 '18 at 07:11

0 Answers0