1

I have a Hive table with the schema:

id             bigint
name           string
updated_dt     bigint

There are many records having same id, but different name and updated_dt. For each id, I want to return the record (whole row) with the largest updated_dt.

My current approach is:

After reading data from Hive, I can use case class to convert data to RDD, and then use groupBy() to group by all the records with the same id together, and later picks the one with the largest updated_dt. Something like:

 dataRdd.groupBy(_.id).map(x => x._2.toSeq.maxBy(_.updated_dt))

However, since I use Spark 2.1, it first convert data to dataset using case class, and then the above approach coverts data to RDD in order to use groupBy(). There may be some overhead converting dataset to RDD. So I was wondering if I can achieve this at the dataset level without converting to RDD?

Thanks a lot

ReKx
  • 996
  • 2
  • 10
  • 23
michelle
  • 197
  • 2
  • 14

1 Answers1

1

Here is how you can do it using Dataset:

data.groupBy($"id").agg(max($"updated_dt") as "Max") 

There is not much overhead if you convert it to RDD. If you choose to do using RDD, It can be more optimized by using .reduceByKey() instead of using .groupBy():

dataRdd.keyBy(_.id).reduceByKey((a,b) => if(a.updated_dt > b.updated_dt) a else b).values
vdep
  • 3,541
  • 4
  • 28
  • 54
  • Thanks. "data.groupBy($"id").agg(max($"updated_dt") as "Max")" only returns max updated_dt, but I want to return the whole record/row with max updated_dt. – michelle Apr 07 '18 at 15:38
  • Its mentioned in the link, posted by @user8371915 – vdep Apr 07 '18 at 16:39