I would create a column to rank your "rate"
column and then pivot
:
First create a "rank"
column and concatenate the string "rate_"
to the row_number
:
from pyspark.sql.functions import concat, first, lit, row_number
from pyspark.sql import Window
df = df.withColumn(
"rank",
concat(
lit("rate_"),
row_number().over(Window.partitionBy("group")\
.orderBy("rate")).cast("string")
)
)
df.show()
#+-----+----+------+
#|group|rate| rank|
#+-----+----+------+
#| B| 0.1|rate_1|
#| B| 0.3|rate_2|
#| C| 0.1|rate_1|
#| C| 0.2|rate_2|
#| A| 0.1|rate_1|
#| A| 0.2|rate_2|
#+-----+----+------+
Now group by the "group"
column and pivot
on the "rank"
column. Since you need an aggregation, use first
.
df.groupBy("group").pivot("rank").agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+
The above does not depend on knowing the number of records in each group ahead of time.
However if (like you said) you know the number of records in each group you can make the pivot
more efficient by passing in the values
num_records = 2
values = ["rate_" + str(i+1) for i in range(num_records)]
df.groupBy("group").pivot("rank", values=values).agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+