0

I am start to use pyspark first on the single machine and have a question about cache()

In the jupyter notebook, I first start my session

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
spark = SparkSession.builder.appName('Test').getOrCreate()

Say I want to just count the number of rows in my table T1. So I did the following

T1.createOrReplaceTempView("T1_sdf")
S1 = spark.sql("select count(*) as total from T1_sdf")

First I found that by just executing these 2 commands, it seems that the notebook (or spark?) has not done any real computation yet, then after I do

S1.show()

It starts to really count the number of rows in T1. Later on, I have to use the table T1 repeatly to perform some other operations

I always heard people mentioned cache. First I am not sure how to actually implement cache in this sql context.

In addition to that, I am wondering why the code after cache will speed up and exactly which part of the code executing process has been speed up? I had an experience in dynamic programming before where I had to write a recursion. At that place, I understand very clearly why cache makes a great difference, since when computing new stuff, I need the value of some intermediate step which I've already computed before, then I want to store the value of those intermediate steps so that when I have to use these value, I can directly fetch them from a place instead of recomputing them. But here, in this context of cache T1 I am not sure why cache helps to speed up? I guess I don't understand how spark execute the python or sql code under the hood

KevinKim
  • 1,382
  • 3
  • 18
  • 34
  • Caching (in general) keeps the data in memory. Not having to re-load it saves time. – Gordon Linoff Feb 11 '17 at 16:38
  • @GordonLinoff This I understand, I guess the question I had is how pyspark handles that. For example, `T1` is already a spark dataframe in memory (I am on single machine). To perform sql operation, I first have to create a view `T1_sdf`. So in my case, I don't have to re-load the dataframe `T1` since it is already in memory, so this means in my case, cache does not help me at all? – KevinKim Feb 11 '17 at 16:47

0 Answers0