0

I have the following DataFrame df:

customer_id   product_id   timestamp   action
111           1            1519030817  add
111           1            1519030917  remove
111           2            1519030819  add
222           2            1519030819  add

I want to group records by customer_id and product_id, and take the last action.

This is what I did:

df.groupBy("customer_id","product_id").orderBy(desc("timestamp"))

But how can I actually take the latest action?

The result should be the following:

customer_id   product_id   timestamp   action
111           1            1519030917  remove
111           2            1519030819  add
222           2            1519030819  add
Markus
  • 3,562
  • 12
  • 48
  • 85

1 Answers1

0

You can use Window function as below

  val w = Window.partitionBy("customer_id", "product_id")
          .orderBy(desc("timestamp"), desc("action"))

  df.withColumn("rn", row_number().over(w))
          .where($"rn" === 1).drop("rn") show (false)

Output:

+-----------+----------+----------+------+
|customer_id|product_id|timestamp |action|
+-----------+----------+----------+------+
|111        |2         |1519030819|add   |
|222        |2         |1519030819|add   |
|111        |1         |1519030917|remove|
+-----------+----------+----------+------+
koiralo
  • 22,594
  • 6
  • 51
  • 72