I have a Hive table with the schema:
id bigint
name string
updated_dt bigint
There are many records having same id, but different name and updated_dt. For each id, I want to return the record (whole row) with the largest updated_dt.
My current approach is:
After reading data from Hive, I can use case class to convert data to RDD, and then use groupBy() to group by all the records with the same id together, and later picks the one with the largest updated_dt. Something like:
dataRdd.groupBy(_.id).map(x => x._2.toSeq.maxBy(_.updated_dt))
However, since I use Spark 2.1, it first convert data to dataset using case class, and then the above approach coverts data to RDD in order to use groupBy(). There may be some overhead converting dataset to RDD. So I was wondering if I can achieve this at the dataset level without converting to RDD?
Thanks a lot