I am using spark-sql-2.4.1v in my project.
To get a unique companies for given time range I am using spark window-function with rank. But for the same date for the same company id , it is ranking two records as top 1 record.
I am using below code snippet to find unique i.e. rank-1 records.
val ws = Window.partitionBy("company_id").orderBy(desc("created_date"))
val rankDs = sourceDs.withColumn("rank",rank().over(ws))
.where($"rank".===(lit(1)))
.select("*")
The results it shows is
+-------------+-----------+----+
| created_date| company_id|rank|
+-------------+-----------+----+
| 2018-02-01| 17869354| 1|
| 2018-02-01| 17869354| 1|
| 2018-02-01| 41248792| 1|
| 2018-02-01| 41248792| 1|
| 2018-02-01| 86947653| 1|
| 2018-02-01| 86947653| 1|
--------------------------------
As show above it is giving duplicate records for few source records, so what is wrong here , how to fix this i.e. only give rank-1 records without duplicates ?
Expected Result :
+-------------+-----------+----+
| created_date| company_id|rank|
+-------------+-----------+----+
| 2018-02-01| 17869354| 1|
| 2018-02-01| 41248792| 1|
| 2018-02-01| 86947653| 1|
--------------------------------