You can use groupBy.pivot
and then aggregate the sale_amount column, in this case, you can take the first value from each combination ids of item and week if there are no more than one row within each combination:
df.groupBy("item_id").pivot("week_id").agg(first("sale_amount")).show
+-------+---+---+---+
|item_id| 1| 2| 3|
+-------+---+---+---+
| 1| 10| 12| 15|
| 2| 4| 7| 9|
+-------+---+---+---+
You can use other aggregation functions if there are more than one row for each combination of item_id
and week_id
, the sum for instance:
df.groupBy("item_id").pivot("week_id").agg(sum("sale_amount")).show
+-------+---+---+---+
|item_id| 1| 2| 3|
+-------+---+---+---+
| 1| 10| 12| 15|
| 2| 4| 7| 9|
+-------+---+---+---+
To get proper column names, you can transform the week_id
column before pivoting:
import org.apache.spark.sql.functions._
(df.withColumn("week_id", concat(lit("week_"), df("week_id"))).
groupBy("item_id").pivot("week_id").agg(first("sale_amount")).show)
+-------+------+------+------+
|item_id|week_1|week_2|week_3|
+-------+------+------+------+
| 1| 10| 12| 15|
| 2| 4| 7| 9|
+-------+------+------+------+