I have a customer spending history table spent_df shown as below (The actual table is much larger, the following table shows the structure of the original table):
id month spent
A 1 5
B 3 15
C 1 12
A 2 99
C 3 35
B 1 10
C 2 20
The table is a PySpark DataFrame. And I want to convert it to a table in which:
1) each customer's spent is corresponding to one row; 2) each month's spend is corresponding to one column. Structure is shown as below:
id\month 1 2 3
A 5 99 null
B 10 null 15
C 12 20 35
As the table is huge, I want to use PySpark library to implement this transformation. The code I wrote is as follows:
pent_final = spent_df.select("id").distinct()
for i in np.arange(1,4):
i_str = str(i)
spent_tmp = spent_df.select("id","month",col("spent").alias(i_str)).where(col("month")== i)
spent_final = spent_final.alias('a').join(spent_tmp.alias('b'), col('a.id') == col('b.id'),'left_outer').select(col('a.*), col('b.' + i_str))
I am very new to Pyspark
. But seems the code does the job. However, I am wondering is there a better way to do this? Can I do it without loop?