I have written a class which performs standard scaling over grouped data.
class Scaler:
.
.
.
.
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(....) #calculate stats here by doing a groupby and then do a join
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
The idea is to save the mean and variances in columns and simply do a column subtraction/division on the column i want to scale. This part is done in the function transformOne
. So basically its an arithmetic operation on one column.
If i want to scale multiple columns I just call the function transformOne
multiple times but a bit more efficiently using functools.reduce
(see the function transform
. The class works fast enough for a single column but when I have multiple columns it takes too much time.
I have no idea about internals of spark so im a complete newbie. Is there a way i can improve this computation over multiple columns ?