Spark dataframe drop duplicates

Question

Is it possible possible to tell spark drop duplicates to drop the second occurrence instead of first one?

scala> df.show()
+-----------+
|         _1|
+-----------+
|1 2 3 4 5 6|
|9 4 5 8 7 7|
|1 2 3 4 5 6|
+-----------+


scala> val newDf = df.dropDuplicates()
newDf: org.apache.spark.sql.DataFrame = [_1: string]

scala> newDf.show()
+-----------+                                                                   
|         _1|
+-----------+
|9 4 5 8 7 7|
|1 2 3 4 5 6|
+-----------+

You can use window functions http://stackoverflow.com/questions/35498162/spark-remove-duplicate-rows-from-dataframe Thanks! — senthil kumar p, Dec 02 '16 at 03:13

score 0 · Answer 1 · answered Oct 09 '19 at 20:31

0

Ranking/indexing on rows, having same values and then drop entries for all records having index/rank > 1.

answered Oct 09 '19 at 20:31

Mradula Ghatiya

49
1
4

Spark dataframe drop duplicates

1 Answers1