3

I want to create a dataset containing only the first and last value of a window. For my example:
1st entry is the earliest timestamp
2nd entry is the latest timestamp
window is the filename

The Window would look like this:

val FileWindow = Window.partitionBy($"location", $"systemname", $"filename").orderBy($"timestamp)

I have two different approaches for this but they both seem pretty slow.

1)

val ds2_first = ds_filtered
      .withColumn("rn", row_number().over(ascFileWindow))
      .withColumn("max_rn", max($"rn").over(ascFileWindow.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
      .withColumn("lf_flag", when($"rn" === 1 || $"rn" === $"max_rn", 1).otherwise(lit(0)))
      .where($"lf_flag" === 1)

2)

val ascFileWindow = Window.partitionBy($"systemname", $"filename").orderBy($"timestamp".asc)
val descFileWindow = Window.partitionBy($"systemname", $"filename").orderBy($"timestamp".desc)

val ds2_first = ds_filtered
      .withColumn("rn", row_number().over(ascFileWindow))
      .where($"rn" === 1)

val ds2_last = ds_filtered
      .withColumn("rn",row_number().over(descFileWindow))
      .where($"rn" === 1)

    val ds2 = ds2_first.union(ds2_last)

Does someone have a faster / different approach than me? Any help would be appreciated.

user2811630
  • 445
  • 3
  • 11

0 Answers0