Pyspark improving performance for multiple column operations

Question

I have written a class which performs standard scaling over grouped data.

class Scaler:
.
.
.
.
def __transformOne__(self, df_with_stats, newName, colName):
    return df_with_stats\
            .withColumn(newName, 
                        (F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
            .drop(colName)\
            .withColumnRenamed(newName, colName)
    
def transform(self, df):
    df_with_stats = df.join(....) #calculate stats here by doing a groupby and then do a join
    
    return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv), 
                   self.__tempNames__(), df_with_stats)[df.columns]

The idea is to save the mean and variances in columns and simply do a column subtraction/division on the column i want to scale. This part is done in the function transformOne. So basically its an arithmetic operation on one column.

If i want to scale multiple columns I just call the function transformOne multiple times but a bit more efficiently using functools.reduce (see the function transform. The class works fast enough for a single column but when I have multiple columns it takes too much time.

I have no idea about internals of spark so im a complete newbie. Is there a way i can improve this computation over multiple columns ?

score 0 · Accepted Answer · answered Oct 05 '20 at 14:48

My solution does a lot of calls to withColumn function. Hence i changed the solution by using select instead of withColumn. There is substantial difference in the physical plans of both the approaches. For my application I improved from 15 minutes to 2 minutes using select. More information about this in this SO post.

Pyspark improving performance for multiple column operations

1 Answers1