I want to create a dataset containing only the first and last value of a window.
For my example:
1st entry is the earliest timestamp
2nd entry is the latest timestamp
window is the filename
The Window would look like this:
val FileWindow = Window.partitionBy($"location", $"systemname", $"filename").orderBy($"timestamp)
I have two different approaches for this but they both seem pretty slow.
1)
val ds2_first = ds_filtered
.withColumn("rn", row_number().over(ascFileWindow))
.withColumn("max_rn", max($"rn").over(ascFileWindow.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("lf_flag", when($"rn" === 1 || $"rn" === $"max_rn", 1).otherwise(lit(0)))
.where($"lf_flag" === 1)
2)
val ascFileWindow = Window.partitionBy($"systemname", $"filename").orderBy($"timestamp".asc)
val descFileWindow = Window.partitionBy($"systemname", $"filename").orderBy($"timestamp".desc)
val ds2_first = ds_filtered
.withColumn("rn", row_number().over(ascFileWindow))
.where($"rn" === 1)
val ds2_last = ds_filtered
.withColumn("rn",row_number().over(descFileWindow))
.where($"rn" === 1)
val ds2 = ds2_first.union(ds2_last)
Does someone have a faster / different approach than me? Any help would be appreciated.