i am doing multiple joins with the same data frame the data frames i am joining with are result of group by on my original data frame.
listOfCols = ["a","b","c",....]
for c in listOfCols:
means=df.groupby(col(c)).agg(mean(target).alias(f"{c}_mean_encoding"))
df=df.join(means,c,how="left")
this code produces more than 100000 tasks and takes forever to finish. i see in the dag a lot of shuffling happening. how can i optimize this code ?