how to remove duplicate rows on the basis of single column in Spark SQL using JAVA

Question

I need to understand How can I remove duplicate rows from a Data-frame on the basis of single in Spark SQL using Java.

Like in normal SQL, ROW_NUMBER () OVER (PARTITION BY col ORDER BY Col DESC). How Can I translate this step into Spark SQL in Java?

score 1 · Answer 1 · answered Jul 24 '17 at 18:06

1

You can remove duplicates from dataframe using dataframe.dropDuplicates("col1"). It will remove all rows which has duplicates in col1. This API is available from Spark 2.x on wards.

answered Jul 24 '17 at 18:06

Sagar balai

479
6
13

score 0 · Answer 2 · answered Jul 24 '17 at 18:22

You are looking correctly. We should use the windowing function and then filter out dataframe with row_number=1 to get the lastest record(order by field helps in giving row_number).

Follow the below links.

http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html

How to use Analytic/Window Functions in Spark Java?

how to remove duplicate rows on the basis of single column in Spark SQL using JAVA

2 Answers2