Aggregation using Dataframe

Question

I am new to Spark. Need help on implementing the logic in Spark using dataframe. Assume that I have one dataframe df1 with the following data.

DF1 :

txn-id,productid,desc
1,'AA','ADESC'
2,'BB','BDESC'
3,'CC','CDESC'
4,'BB','ZDESC'
5,'CC','YDESC'

I want the desired output in the below format using dataframe(without use of spark sql).Basically want to do group by on productid and want to select the max of transaction id and desc of that transaction id.

Result:

txn-id,productid,desc
1,'AA','ADESC'
4,'BB','ZDESC'
5,'CC','YDESC'

Can you please help me with the logic.

Thanks, Sumit

score 0 · Answer 1 · answered Jul 05 '18 at 13:04

Use window with partition on productid col.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

df1.select(col("*"), row_number.over(
  Window.partitionBy("productid").orderBy(col("txn-id").desc)
).as("rnum")).filter(col("rnum") === 1).drop("rnum")

Aggregation using Dataframe

1 Answers1