Spark remove duplicate rows from DataFrame

Question

Assume that I am having a DataFrame like :

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)

I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only

{"a":3, "b":9, "c":22, "d":12}

and

{"a":1, "b":4, "c":23, "d":12}

Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey(), but what DataFrame specific operation is there to do this?

Appreciate some help, thanks.

How did you decide which c and d values to keep? If you wanted the max of those then something like df.groupBy("a").max("a", "b", "c") would do it. — Robert Horvick, Feb 19 '16 at 05:56
I am not looking for the max of "c" and "d". Just need the row for "a" with max("b"). — void, Feb 19 '16 at 06:10

score 9 · Accepted Answer · answered Feb 19 '16 at 06:16

9

You can use window function in sparksql to achieve this.

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

This will achieve what you need. Read more about Window function suupport https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

answered Feb 19 '16 at 06:16

Pankaj Arora

544
2
6

Thanks! but that's giving me an exception. `failure: ``)'' expected but '(' found` near `OVER(` – void Feb 19 '16 at 07:02
It seems you need to upgrade to 1.4 (or higher) – Marmite Bomber Feb 19 '16 at 07:16
I am running on 1.6.0. And, it's giving me the exception `org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;` when I follow the examples in https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html – void Feb 19 '16 at 07:20
Window functions are not supported for regular DF? – void Feb 19 '16 at 07:20
Seems you need to rebuild spark with hive support. http://stackoverflow.com/questions/33792306/why-does-using-rank-windowing-function-break-the-parser – Pankaj Arora Feb 19 '16 at 07:40
Window functions aren't available on sqlcontext, only hive – Robert Horvick Feb 19 '16 at 12:33
Self building with hive support: http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support Or you can download the notebook with hive support that i use sometimes from here : http://spark-notebook.io/ – Pankaj Arora Feb 19 '16 at 13:01

Spark remove duplicate rows from DataFrame

1 Answers1

Linked