I've researched a little on this and found an answer for general Spark applications. However, in structured streaming, you cannot do a join between 2 streaming dataframes (therefore a self join is not possible) and sorting functions cannot be used as well. So is there a way to get the latest entry for each group at all? (I'm on Spark 2.2)
UPDATE: Assuming the dataframe rows are already sorted by time already, we can just take the last entry using for each required row using groupBy
then agg
with the pyspark.sql.functions.last
function.