Give following code
case class Contact(name: String, phone: String)
case class Person(name: String, ts:Long, contacts: Seq[Contact])
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val people = sqlContext.read.format("orc").load("people")
What is the best way to dedupe users by its timestamp So the user with max ts will stay at collection? In spark using RDD I would run something like this
rdd.reduceByKey(_ maxTS _)
and would add the maxTS method to Person or add implicits ...
def maxTS(that: Person):Person =
that.ts > ts match {
case true => that
case false => this
}
Is it possible to do the same at DataFrames? and will that be the similar performance? We are using spark 1.6