Suppose we start from some data and gets some intermediate result df_intermediate
. Along the pipeline from source data to df_intermediate
, all transformations are lazy and nothing is actually computed.
Then I would like to perform two different transformations to df_intermediate
. For example, I would like to calculate df_intermediate.agg({"col":"max"})
and df_intermediate.approxquantile("col", [0.1,0.2,0.3], 0.01)
using two separate commands.
I wonder in the following scenario, does spark need to recompute df_intermediate
when it is performing the second transformation? In other words, does Spark perform the calculation for the above two transformations both start from the raw data without storing the intermediate result? Obviously I can cache the intermediate result but I'm just wondering if Spark does this kind of optimization internallly.