2

I have a data frame looks like:

item_id  week_id  sale amount
1           1         10
1           2         12
1           3         15
2           1         4
2           2         7
2           3         9

I want to transform this dataframe to a new data frame looks like:

item_id   week_1     week_2     week_3
1          10          12         15
2          4            7          9

This can be easily done in R, but I don't know how to do it using Spark API, with Scala.

zero323
  • 322,348
  • 103
  • 959
  • 935
lserlohn
  • 5,878
  • 10
  • 34
  • 52

1 Answers1

5

You can use groupBy.pivot and then aggregate the sale_amount column, in this case, you can take the first value from each combination ids of item and week if there are no more than one row within each combination:

df.groupBy("item_id").pivot("week_id").agg(first("sale_amount")).show
+-------+---+---+---+
|item_id|  1|  2|  3|
+-------+---+---+---+
|      1| 10| 12| 15|
|      2|  4|  7|  9|
+-------+---+---+---+

You can use other aggregation functions if there are more than one row for each combination of item_id and week_id, the sum for instance:

df.groupBy("item_id").pivot("week_id").agg(sum("sale_amount")).show
+-------+---+---+---+
|item_id|  1|  2|  3|
+-------+---+---+---+
|      1| 10| 12| 15|
|      2|  4|  7|  9|
+-------+---+---+---+

To get proper column names, you can transform the week_id column before pivoting:

import org.apache.spark.sql.functions._

(df.withColumn("week_id", concat(lit("week_"), df("week_id"))).
    groupBy("item_id").pivot("week_id").agg(first("sale_amount")).show)

+-------+------+------+------+
|item_id|week_1|week_2|week_3|
+-------+------+------+------+
|      1|    10|    12|    15|
|      2|     4|     7|     9|
+-------+------+------+------+
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks, how to auto fill "0" if there is no value "week_id" for some item_id? – lserlohn Jan 31 '17 at 22:28
  • 1
    You can use `na.fill(0)` to fill missing values with 0. `df.withColumn("week_id", concat(lit("week_"), df("week_id"))).groupBy("item_id").pivot("week_id").agg(first("sale_amount")).na.fill(0).show)` – Psidom Jan 31 '17 at 22:31