I am working in apache-spark
to make multiple transformations on a single Dataframe with python.
I have coded some functions in order to make easier the different transformations. Imagine we have functions like:
clearAccents(df,columns)
#lines that remove accents from dataframe with spark functions or
#udf
return df
I use those functions to "overwrite" dataframe variable for saving the new dataframe transformed each time each function returns. I know this is not a good practice and now I am seeing the consequences.
I noticed that everytime I add a line like below, the running time is longer:
# Step transformation 1:
df = function1(df,column)
# Step transformation 2.
df = function2(df, column)
As I understand Spark is not saving the resulting dataframe but it saves all operations needed to get the dataframe in the current line. For example, when running the function function1
Spark runs only this function, but when running function2
Spark runs function1
and then, function2
. What if I really need to run each function only one?
I tried with a df.cache()
and df.persist()
but I din't get the desired results.
I want to save partial results in a manner that wouldn't be necessary to compute all instruction since beginning and only from last transformation function, without getting an stackoverflow error.