Assume that I am having a DataFrame like :
val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)
I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only
{"a":3, "b":9, "c":22, "d":12}
and
{"a":1, "b":4, "c":23, "d":12}
Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey()
, but what DataFrame specific operation is there to do this?
Appreciate some help, thanks.