How can I save partial results of dataframe transformation processes in pyspark?

Question

I am working in apache-spark to make multiple transformations on a single Dataframe with python.

I have coded some functions in order to make easier the different transformations. Imagine we have functions like:

clearAccents(df,columns)
#lines that remove accents from dataframe with spark functions or 
#udf
    return df

I use those functions to "overwrite" dataframe variable for saving the new dataframe transformed each time each function returns. I know this is not a good practice and now I am seeing the consequences.

I noticed that everytime I add a line like below, the running time is longer:

# Step transformation 1:
df = function1(df,column)
# Step transformation 2.
df = function2(df, column)

As I understand Spark is not saving the resulting dataframe but it saves all operations needed to get the dataframe in the current line. For example, when running the function function1 Spark runs only this function, but when running function2 Spark runs function1 and then, function2. What if I really need to run each function only one?

I tried with a df.cache() and df.persist() but I din't get the desired results.

I want to save partial results in a manner that wouldn't be necessary to compute all instruction since beginning and only from last transformation function, without getting an stackoverflow error.

Possible duplicate of http://stackoverflow.com/questions/33424445/ — Daniel de Paula, May 10 '16 at 17:10

score 1 · Answer 1 · edited May 23 '17 at 12:33

You probably aren't getting the desired results from cache() or persist() because they won't be evaluated until you invoke an action. You can try something like this:

# Step transformation 1:
df = function1(df,column).cache()

# Now invoke an action
df.count()

# Step transformation 2.
df = function2(df, column)

To see that the execution graph changes, the SQL tab in the Spark Job UI is a particularly helpful debugging tool.

I'd also recommend checking out the ML Pipeline API and seeing if it's worth implementing a custom Transformer. See Create a custom Transformer in PySpark ML.

How can I save partial results of dataframe transformation processes in pyspark?

1 Answers1