I'm currently doing some experiments with spark in order to better understand what performance I can expect when working with highly cascaded queries.
It seems to me, that invoking persist() (or cache()) on intermediary results causes my execution time to increase exponentially.
Consider this minimal example:
SparkSession spark = SparkSession.builder()
.appName(getClass().getName())
.master("local[*]")
.getOrCreate();
Dataset ds = spark.range(1);
for (int i = 1; i < 200; i++) {
ds = ds.withColumn("id", ds.col("id"));
ds.cache();
long t0 = System.currentTimeMillis();
long cnt = ds.count();
long t1 = System.currentTimeMillis();
System.out.println("Iteration " + String.format("%3d", i) + " count: " + cnt + " time: " + (t1 - t0) + "ms");
}
Without the ds.cache() in the code the time for count() is rather constant. With ds.cache() however execution time start to increase exponentially:
iteration without cache() with cache()
... ... ...
24 61 297
25 74 515
26 86 1.036
27 78 1.904
28 73 3.233
29 79 6.815
30 75 12.549
31 107 26.379
32 69 46.207
33 54 102.172
Any idea what is going on here? From what I understand how .persist works, this doesn't make much sense.
Thanks,
Thorsten