1

I have a data frame looks like:

Region, 2000Q1, 2000Q2, 2000Q3, ...
A, 1,2,3,...

I want to transpose this wide table to a long table by 'Region'. So the final product will look like:

Region, Time, Value
A, 2000Q1,1
A, 2000Q2, 2
A, 2000Q3, 3
A, 2000Q4, 4
....

The original table has a very wide array of columns but the aggregation level is always region and remaining columns are set to be tranposed.

Do you know an easy way or function to do this?

Yi Du
  • 455
  • 2
  • 8
  • 17
  • what you are trying to do is a melt. You should try something like in [that](https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe). – Seb Aug 20 '20 at 19:10

2 Answers2

2

Try with arrays_zip function then explode the array

Example:

df=spark.createDataFrame([('A',1,2,3)],['Region','2000q1','2000q2','2000q3'])

from pyspark.sql.functions import *
from pyspark.sql.types import *

df.withColumn("cc",explode(arrays_zip(array(cols),split(lit(col_name),"\\|")))).\
select("Region","cc.*").\
toDF(*['Region','Value','Time']).\
show()
#+------+-----+------+
#|Region|Value|  Time|
#+------+-----+------+
#|     A|    1|2000q1|
#|     A|    2|2000q2|
#|     A|    3|2000q3|
#+------+-----+------+
notNull
  • 30,258
  • 4
  • 35
  • 50
0

Similar but improved for the column calculation.

cols = df.columns
cols.remove('Region')

import pyspark.sql.functions as f

df.withColumn('array', f.explode(f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array(*cols), ))) \
  .select('Region', 'array.*') \
  .toDF('Region', 'Time', 'Value') \
  .show(30, False)

+------+------+-----+
|Region|Time  |Value|
+------+------+-----+
|A     |2000Q1|1    |
|A     |2000Q2|2    |
|A     |2000Q3|3    |
|A     |2000Q4|4    |
|A     |2000Q5|5    |
+------+------+-----+

p.s. Don't accept this as an answer :)

Lamanus
  • 12,898
  • 4
  • 21
  • 47