2

If i read a file in pyspark:

Data = spark.read(file.csv)

Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct? If yes, why do i need:

Data.cache()
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Victor
  • 16,609
  • 71
  • 229
  • 409
  • I like to think about cache as a save point. In a scenario where you have done many aggregations that needs at least 5 minutes to finish, every time that you call `show()` will execute everything from zero, but if you cache your dataframe after the aggregation result, your `show()` command will execute almost instantly. – Kafels May 14 '21 at 20:37
  • And you're thinking of a scenario where all the data will fit in the cluster's memory, but in cases where it won't fit and the data will be spilled onto the disk, if you don't do a filter and cache the result, every time that `show()` needs to be run, your application will move the data from disk to memory. – Kafels May 14 '21 at 20:43
  • 1
    Does this answer your question? [(Why) do we need to call cache or persist on a RDD](https://stackoverflow.com/questions/28981359/why-do-we-need-to-call-cache-or-persist-on-a-rdd) – pltc May 14 '21 at 20:46
  • 1
    So we cannot assign show() to a variable? X = data.show()? And then if we use x multiple times, the value is already cached in x,right? – Victor May 14 '21 at 21:08
  • 1
    @Victor `show()` only outputs a data sample. I would recommend you read about spark actions & transformations. Specially about lazy operations. – Kafels May 15 '21 at 00:14
  • has nothing to do with pyspark – thebluephantom May 15 '21 at 10:20

1 Answers1

3

If i read a file in pyspark: Data = spark.read(file.csv) Then for the life of the spark session, the ‘data’ is available in memory,correct?

No. Nothing happens here due to Spark lazy evaluation, which happens upon the first call to show() in your case.

So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct?

No. The dataframe will be re-evaluated for each call to show. Caching the dataframe will prevent that re-evaluation, forcing the data to be read from cache instead.

Chris
  • 1,335
  • 10
  • 19