1

I need to understand How can I remove duplicate rows from a Data-frame on the basis of single in Spark SQL using Java.

Like in normal SQL, ROW_NUMBER () OVER (PARTITION BY col ORDER BY Col DESC). How Can I translate this step into Spark SQL in Java?

us56
  • 283
  • 1
  • 3
  • 12

2 Answers2

1

You can remove duplicates from dataframe using dataframe.dropDuplicates("col1"). It will remove all rows which has duplicates in col1. This API is available from Spark 2.x on wards.

Sagar balai
  • 479
  • 6
  • 13
0

You are looking correctly. We should use the windowing function and then filter out dataframe with row_number=1 to get the lastest record(order by field helps in giving row_number).

Follow the below links.

http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html

How to use Analytic/Window Functions in Spark Java?

loneStar
  • 3,780
  • 23
  • 40