0

I have the following DataFrame df in Spark:

+------------+---------+-----------+
|OrderID     |     Type|        Qty|
+------------+---------+-----------+
|      571936|    62800|          1|
|      571936|    62800|          1|
|      571936|    62802|          3|
|      661455|    72800|          1|
|      661455|    72801|          1|

I need to select the row that has a largest value of Qty per each unique OrderID or the last rows per OrderID if all Qty are the same (e.g. as for 661455). The expected result:

+------------+---------+-----------+
|OrderID     |     Type|        Qty|
+------------+---------+-----------+
|      571936|    62802|          3|
|      661455|    72801|          1|

Any ides how to get it?

This is what I tried:

val partitionWindow = Window.partitionBy(col("OrderID")).orderBy(col("Qty").asc)
val result = df.over(partitionWindow)
Fluxy
  • 2,838
  • 6
  • 34
  • 63

1 Answers1

0
scala> val w = Window.partitionBy("OrderID").orderBy("Qty")
scala> val w1 = Window.partitionBy("OrderID")

scala> df.show()
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 571936|62800|  1|
| 571936|62800|  1|
| 571936|62802|  3|
| 661455|72800|  1|
| 661455|72801|  1|
+-------+-----+---+


scala> df.withColumn("rn",  row_number.over(w)).withColumn("mxrn", max("rn").over(w1)).filter($"mxrn" === $"rn").drop("mxrn","rn").show
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 661455|72801|  1|
| 571936|62802|  3|
+-------+-----+---+
Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24