Using pyspark/Delta lakes on Databricks, I have the following scenario:
sdf = spark.read.format("delta").table("...")
result = sdf.filter(...).groupBy(...).agg(...)
analysis_1 = result.groupBy(...).count() # transformation performed here
analysis_2 = result.groupBy(...).count() # transformation performed here
As I understand Spark with Delta lakes, due to chained execution, result
is not actually computed upon declaration, but rather when it is used.
However, in this example, it is used multiple times, and hence the most expensive transformation is computed multiple times.
Is it possible to force execution at some point in the code, e.g.
sdf = spark.read.format("delta").table("...")
result = sdf.filter(...).groupBy(...).agg(...)
result.force() # transformation performed here??
analysis_1 = result.groupBy(...).count() # quick smaller transformation??
analysis_2 = result.groupBy(...).count() # quick smaller transformation??