0

I have the following dataframe df in Spark Scala:

id   project  start_date    Change_date designation

1    P1       08/10/2018      01/09/2017   2
1    P1       08/10/2018      02/11/2018   3
1    P1       08/10/2018      01/08/2016   1

then get designation closure to start_date and less than that

Expected output:

id   project  start_date    designation
1      P1     08/10/2018    2

This is because change date 01/09/2017 is the closest date before start_date.

Can somebody advice how to achieve this?

This is not selecting first row but selecting the designation corresponding to change date closest to the start date

user3222101
  • 1,270
  • 2
  • 24
  • 43
  • thanks for checking! this is not duplicate as i am not interested in knowing the first row only, i want the data as per diff logiv – user3222101 Jan 30 '18 at 00:40

1 Answers1

1

Parse dates:

import org.apache.spark.sql.functions._

val spark: SparkSession = ???
import spark.implicits._

val df = Seq(
  (1, "P1", "08/10/2018", "01/09/2017", 2), 
  (1, "P1", "08/10/2018", "02/11/2018", 3),
  (1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")

val parsed = df
  .withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))        
  .withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))

Find difference

 val diff = parsed
   .withColumn("diff", datediff($"start_date", $"changed_date"))
   .where($"diff" > 0)

Apply solution of your choice from How to select the first row of each group?, for example window functions. If you group by id:

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy($"id").orderBy($"diff")

diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// |  1|        P1|2018-10-08|  2017-09-01|          2| 402|
// +---+----------+----------+------------+-----------+----+

Reference:

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115