My question has two parts to it. First one is to understand the way Spark works and the second one is on optimization.
I have a spark dataframe which has multiple categorical variables. For each of these categorical variables I am adding a new column wherein each row is the frequency of the corresponding level.
For example
Date_Built Square_Footage Num_Beds Num_Baths State Price Freq_State
01/01/1920 1700 3 2 NY 700000 4500
Here for State
(a categorical variable), I am adding a new variable Freq_State
. The level NY
appears 4500
times in the dataset so this row gets 4500
in the Freq_State
column.
I have multiple such columns where I am adding a column bearing frequency of corresponding levels.
This is the code I am using for achieving this
def calculate_freq(df, categorical_cols):
for each_cat_col in categorical_cols:
_freq = df.select(each_cat_col).groupBy(each_cat_col).count()
df = df.join(_freq, each_cat_col, "inner")
return df
Part 1
Here, as you can see, I am updating the dataframe in the for
loop. Is this way of updating a dataframe advisable when I'm running this code on a cluster? I wouldn't have been concerned about this if it was a pandas dataframe. But I am not certain when the context changes to spark.
Also, would it make a difference if I was simply running the above process in a loop and not inside a function?
Part 2
Is there a more optimized way to do this? Here I am joining each time I enter a loop? Can this be avoided