0

Is it possible possible to tell spark drop duplicates to drop the second occurrence instead of first one?

scala> df.show()
+-----------+
|         _1|
+-----------+
|1 2 3 4 5 6|
|9 4 5 8 7 7|
|1 2 3 4 5 6|
+-----------+


scala> val newDf = df.dropDuplicates()
newDf: org.apache.spark.sql.DataFrame = [_1: string]

scala> newDf.show()
+-----------+                                                                   
|         _1|
+-----------+
|9 4 5 8 7 7|
|1 2 3 4 5 6|
+-----------+
zero323
  • 322,348
  • 103
  • 959
  • 935
G G
  • 1,049
  • 4
  • 17
  • 26

1 Answers1

0

Ranking/indexing on rows, having same values and then drop entries for all records having index/rank > 1.