0

I would like to find the last record for an ID for a typed DataSet. I found a solution based on Dataframe : "Find minimum for a timestamp through Spark groupBy dataframe" Find minimum for a timestamp through Spark groupBy dataframe

But how doing the same with typed dataset ?

Something like :

    case class Person(id: Int, name: String, time: Timestamp, kind: String) 

    val ds:DataSet[Person] = Seq(

        (1, "Bob", parseDate("03/08/02 00:00:00"), "P"),
        (1, "Bob", parseDate("04/08/02 00:00:00"), "PI"),
        (1, "Bob", parseDate("03/08/02 12:00:00"), "PE"))
        .toDF("id", "name", "time", "kind").as[Person]

    ds.groupByKey(_.id)
        .agg(max(_.time), _)
    //            .agg(max(struct("time", columnsButTime: _*)) as "all") => Work with Datafrane
    //            .select("all.*")
  • Can you define what _"the last record"_ is? nit: There's no `DataSet` type. – Jacek Laskowski Oct 30 '17 at 21:58
  • This the record with latest time value for one id. In my exemple, all the record has the same id, so the second one is the latest. > (1, "Bob", parseDate("03/08/02 00:00:00"), "P"), > (1, "Bob", parseDate("04/08/02 00:00:00"), "PI"), > (1, "Bob", parseDate("03/08/02 12:00:00"), "PE"), > (2, "Alic", parseDate("03/01/02 12:00:00"), "PE"), > (2, "Alice", parseDate("03/01/02 12:05:00"), "PE"), We expect to have the result : > (1, "Bob", parseDate("04/08/02 00:00:00"), "PI"), > (2, "Alice", parseDate("03/01/02 12:05:00"), "PE") – user1759985 Oct 31 '17 at 23:23

0 Answers0