0

I am using spark-sql-2.4.1v in my project.

To get a unique companies for given time range I am using spark window-function with rank. But for the same date for the same company id , it is ranking two records as top 1 record.

I am using below code snippet to find unique i.e. rank-1 records.

val ws = Window.partitionBy("company_id").orderBy(desc("created_date"))

val rankDs =  sourceDs.withColumn("rank",rank().over(ws))
                                      .where($"rank".===(lit(1)))
                                      .select("*")

The results it shows is

+-------------+-----------+----+
| created_date| company_id|rank|
+-------------+-----------+----+
|   2018-02-01|   17869354|   1|
|   2018-02-01|   17869354|   1|
|   2018-02-01|   41248792|   1|
|   2018-02-01|   41248792|   1|
|   2018-02-01|   86947653|   1|
|   2018-02-01|   86947653|   1|
--------------------------------

As show above it is giving duplicate records for few source records, so what is wrong here , how to fix this i.e. only give rank-1 records without duplicates ?

Expected Result :

+-------------+-----------+----+
| created_date| company_id|rank|
+-------------+-----------+----+
|   2018-02-01|   17869354|   1|
|   2018-02-01|   41248792|   1|
|   2018-02-01|   86947653|   1|
--------------------------------
BdEngineer
  • 2,929
  • 4
  • 49
  • 85

1 Answers1

2

Instead of rank use row_number() window function to get the unique records without duplicates.

val ws = Window.partitionBy("company_id").orderBy(desc("created_date"))

val rankDs =  sourceDs.withColumn("row_number",row_number().over(ws))
                                      .where($"row_number".===(lit(1)))
                                      .select("*")
//result
//+------------+----------+---+----------+
//|created_date|company_id| no|row_number|
//+------------+----------+---+----------+
//|  2018-02-01|  86947653|  1|         1|
//|  2018-02-01|  41248792|  1|         1|
//|  2018-02-01|  17869354|  1|         1|
//+------------+----------+---+----------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • any advice how to solve this in spark ... https://stackoverflow.com/questions/62933135/dataframe-look-up-and-optimization – BdEngineer Jul 16 '20 at 11:03