I am start to use pyspark first on the single machine and have a question about cache()
In the jupyter notebook, I first start my session
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
spark = SparkSession.builder.appName('Test').getOrCreate()
Say I want to just count the number of rows in my table T1
. So I did the following
T1.createOrReplaceTempView("T1_sdf")
S1 = spark.sql("select count(*) as total from T1_sdf")
First I found that by just executing these 2 commands, it seems that the notebook (or spark?) has not done any real computation yet, then after I do
S1.show()
It starts to really count the number of rows in T1
. Later on, I have to use the table T1
repeatly to perform some other operations
I always heard people mentioned cache
. First I am not sure how to actually implement cache in this sql context.
In addition to that, I am wondering why the code after cache will speed up and exactly which part of the code executing process has been speed up? I had an experience in dynamic programming before where I had to write a recursion. At that place, I understand very clearly why cache makes a great difference, since when computing new stuff, I need the value of some intermediate step which I've already computed before, then I want to store the value of those intermediate steps so that when I have to use these value, I can directly fetch them from a place instead of recomputing them. But here, in this context of cache T1
I am not sure why cache helps to speed up? I guess I don't understand how spark execute the python or sql code under the hood