How should I speedup a Pyspark code sequence?

Question

I have a customer spending history table spent_df shown as below (The actual table is much larger, the following table shows the structure of the original table):

id  month   spent
A   1   5
B   3   15
C   1   12
A   2   99
C   3   35
B   1   10
C   2   20

The table is a PySpark DataFrame. And I want to convert it to a table in which:

1) each customer's spent is corresponding to one row; 2) each month's spend is corresponding to one column. Structure is shown as below:

id\month    1   2   3
A   5   99  null
B   10  null    15
C   12  20  35

As the table is huge, I want to use PySpark library to implement this transformation. The code I wrote is as follows:

pent_final = spent_df.select("id").distinct()
for i in np.arange(1,4):
    i_str = str(i)
    spent_tmp = spent_df.select("id","month",col("spent").alias(i_str)).where(col("month")== i)
    spent_final = spent_final.alias('a').join(spent_tmp.alias('b'), col('a.id') == col('b.id'),'left_outer').select(col('a.*), col('b.' + i_str))

I am very new to Pyspark. But seems the code does the job. However, I am wondering is there a better way to do this? Can I do it without loop?

How should I speedup a Pyspark code sequence?

0 Answers0