2

I have pyspark dataframe below,

enter image description here

I need to convert the dataframe in the following manner, I need to pivot row into a separate column and add it to corresponding month.

enter image description here

I have tried the following approach, df.groupBY("Month").pivot("col1"). I am not getting the desired output.

code_bug
  • 355
  • 1
  • 12
  • Does this answer your question? [How to melt Spark DataFrame?](https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe) – samkart Mar 15 '23 at 06:20

2 Answers2

3

You can use pandas_api() to enable Pandas API on Spark DataFrame (not to be confused with to_pandas()):

sdf = (sdf.pandas_api()
          .melt('Month', var_name='Col_name', value_name='Value')
          .to_spark())

Output:

>>> sdf.show()
+-----+--------+-----+
|Month|Col_name|Value|
+-----+--------+-----+
|    1|    Col1|   10|
|    1|    Col2|   20|
|    1|    Col3|   30|
|    1|    Col4|   40|
|    2|    Col1|   20|
|    2|    Col2|   30|
|    2|    Col3|   50|
|    2|    Col4|   60|
|    3|    Col1|   30|
|    3|    Col2|   50|
|    3|    Col3|   60|
|    3|    Col4|   70|
|    4|    Col1|   30|
|    4|    Col2|   40|
|    4|    Col3|   50|
|    4|    Col4|   80|
+-----+--------+-----+
Corralien
  • 109,409
  • 8
  • 28
  • 52
1

You can use the below code written in PySpark to transform the columns into rows

from pyspark.sql.functions import expr, explode

df = spark.createDataFrame([(1, 10, 20, 30, 40),
                            (2, 20, 30, 50, 60),
                            (3, 30, 50, 60, 70),
                            (4, 30, 40, 50, 80)], 
                           ["Month", "Col1", "Col2", "Col3", "Col4"])

melted_df = df.selectExpr("Month", 
                          "stack(4, 'Col1', Col1, 'Col2', Col2, 'Col3', Col3, 'Col4', Col4) as (Col_name, Value)")

exploded_df = melted_df.select("Month", "Col_name", explode(expr("map('Value', Value)")))

final_df = exploded_df.selectExpr("Month", "Col_name", "Value as Value_new")

final_df.show()